donia-lab / MetaBGC

A metagenomic strategy for harnessing the chemical repertoire of the human microbiome
GNU General Public License v3.0
32 stars 8 forks source link

Isolating input files and different spHMMs provided? #6

Closed Manoo-hao closed 4 years ago

Manoo-hao commented 4 years ago

Thank you for the recent fix of the --prot_seq_directory issue. Excitingly, I can now get MetaBGC to run to completion under certain conditions. Under different conditions (detailed below) I still run into a number of issues along the way depending on how I pass the files to metabgc search.

Specifically using the toy example you provide and using the command: metabgc search --sphmm_directory ${OP_PATH}/path/to/HiPer_spHMMs --prot_family_name Cyclase_OxyN --cohort_name OxyN --nucl_seq_directory ${OP_PATH}/path/to/nucl_seq_dir --seq_fmt FASTA --pair_fmt interleaved --output_directory ${OP_PATH}/path/to/output --cpu 20

leads to: ValueError: Duplicate key 'CP009321.1-2422/2 I attached an error report below. Trying different conditions, e.g. subsampling the files in nucl_seq_dir leads to the same error but with different keys. The files in nucl_seq_dir don't contain CP009321.1-2422/2 (or others) twice, so perhaps it has something to do with passing the key to a dictionary multiple times? This seems to happen at the clustering step, i.e. I still receive the expected output files of earlier steps in the MetaBGC workflow, but am missing the bin_fasta directory and the abundance-tables.

image

On the other hand, If I run the same command as above but with only one of the sample files in nucl_seq_dir, metaBGC runs to completion without error. I have tried this with all 4 of the provided samples (i.e. run the same command 4 times from the same script, but each time pointing at a separate nucl_seq_dir, each containing one of the sample files.

Other issues I ran into were associated with comparing the 2 different provided spHMMs for Cyclase_OxyN, one being from the toy example downloaded from your google drive, and the other one from the MetaBGC/MetaBGC-V1/MetaBGC-Build_Outputs/ repository on this github page. Firstly, the F1_cutoff.txt file seems to be required as .tsv. Even when converting the .txt to .tsv, however, and running metaBGC search as described above on separate input directories, leads to errors at the MetaBGC identify step. Error report attached below. When looking at the respective F1_cutoff files, they have different amounts of columns. Could one of the columns that is present in the drive-version spHMM F1_cutoff and absent in the file provided in the github repository be required?

image

If you would like any additional information regarding any of these issues, please don't hesitate to get in touch.

chuanshangtingfeng commented 4 years ago

I got the same error as Manoo-hao described above.

abiswas-odu commented 4 years ago

I am looking into #1.

Regarding #2, the F1_Cutoff.tsv should be tab separated and have the 'interval' and 'cutoff' columns. Other columns don't matter and are for the user to study.

tamburinif commented 4 years ago

I am also encountering issue #1 with the provided test data

abiswas-odu commented 4 years ago

The issue is that there are duplicate reads with the name CP009321.1-2422/2 are present in the synthetic files i had picked for the toy example. I had picked 4 random synthetic read files. I will append the sample name to the read sequence ID and update the toy dataset. I will also change the .txt extension for Cyclase_OxyN_F1_Cutoff to .tsv.

abiswas-odu commented 4 years ago

Also, for a real run, the reads need to have unique IDs across all sample files. So, if needed append the sample name to the readID using sed.

abiswas-odu commented 4 years ago

I have uploaded an updated toy dataset they fixes this issue: https://drive.google.com/open?id=1-6nhr7WpWFbAxj89F-Q4fmB83nBhBDI-

abiswas-odu commented 4 years ago

Correct Link: https://drive.google.com/file/d/1-6nhr7WpWFbAxj89F-Q4fmB83nBhBDI-/view?usp=sharing