Isolating input files and different spHMMs provided?

Manoo-hao commented 4 years ago

Thank you for the recent fix of the --prot_seq_directory issue. Excitingly, I can now get MetaBGC to run to completion under certain conditions. Under different conditions (detailed below) I still run into a number of issues along the way depending on how I pass the files to metabgc search.

Specifically using the toy example you provide and using the command: metabgc search --sphmm_directory ${OP_PATH}/path/to/HiPer_spHMMs --prot_family_name Cyclase_OxyN --cohort_name OxyN --nucl_seq_directory ${OP_PATH}/path/to/nucl_seq_dir --seq_fmt FASTA --pair_fmt interleaved --output_directory ${OP_PATH}/path/to/output --cpu 20

leads to: ValueError: Duplicate key 'CP009321.1-2422/2 I attached an error report below. Trying different conditions, e.g. subsampling the files in nucl_seq_dir leads to the same error but with different keys. The files in nucl_seq_dir don't contain CP009321.1-2422/2 (or others) twice, so perhaps it has something to do with passing the key to a dictionary multiple times? This seems to happen at the clustering step, i.e. I still receive the expected output files of earlier steps in the MetaBGC workflow, but am missing the bin_fasta directory and the abundance-tables.

On the other hand, If I run the same command as above but with only one of the sample files in nucl_seq_dir, metaBGC runs to completion without error. I have tried this with all 4 of the provided samples (i.e. run the same command 4 times from the same script, but each time pointing at a separate nucl_seq_dir, each containing one of the sample files.

Other issues I ran into were associated with comparing the 2 different provided spHMMs for Cyclase_OxyN, one being from the toy example downloaded from your google drive, and the other one from the MetaBGC/MetaBGC-V1/MetaBGC-Build_Outputs/ repository on this github page. Firstly, the F1_cutoff.txt file seems to be required as .tsv. Even when converting the .txt to .tsv, however, and running metaBGC search as described above on separate input directories, leads to errors at the MetaBGC identify step. Error report attached below. When looking at the respective F1_cutoff files, they have different amounts of columns. Could one of the columns that is present in the drive-version spHMM F1_cutoff and absent in the file provided in the github repository be required?

If you would like any additional information regarding any of these issues, please don't hesitate to get in touch.