Arkadiy-Garber / FeGenie

HMM-based identification and categorization of iron genes and iron gene operons in genomes and metagenomes
GNU Affero General Public License v3.0
53 stars 10 forks source link

ValueError: invalid literal for int() with base 10: 'protein' #13

Open pdalcinmartins opened 4 years ago

pdalcinmartins commented 4 years ago

Dear Arkadiy,

I am getting the error below - could you help me understand what is going on and how to fix this?

Consolidating summary files into one master summary file
Identifying genomic proximities and putative operons
Traceback (most recent call last):
  File "/proj/pdmartins/FeGenie/FeGenie.py", line 2170, in <module>
    main()
  File "/proj/pdmartins/FeGenie/FeGenie.py", line 843, in main
    CoordDict[i][contig].append(int(numOrf))
ValueError: invalid literal for int() with base 10: 'protein'

I do get files such as FinalSummary.csv and .csv files for each category (i.e. iron_reduction-summary.csv, etc).

The commands I am running:

source activate fegenie
FeGenie.py -bin_dir /proj/pdmartins/2020_iron_reactor_analyses/analyses_2/ -bin_ext faa -t 16 -out /proj/pdmartins/2020_iron_reactor_analyses/analyses_2/fegenie --orfs --meta

The input file is a prodigal-generated and prokka-annotated fasta amino acid file. Example of sequence in this file:

unbinned_NHGMMMNG_144129_2E-6E-farnesyl_diphosphate_synthase MPDRITAGVDAVLDELLSERRLPDGLRGMMAYHLGWVDEDLRALPVRQRSKYGGKKMRAV LCALACEAAGGDLETAFPAAAAVELVQNFSLVHDDIEDGDRERRHRPTVWVRWGVPQAIN TGSAMQALVNAAVLRTPAPAETVLDVLRALTAAMVEMTEGQHLDIAFQDRTDVSVAEYED MASRKTGALMEAAAYTGARLAMSDNRRLAAWRQFGRAFGQAFQARDDLLGVTGVPSVTGK PVGNDIRARKKALPLLHALAHATPGDRVLLGRAFSNQAVSDEDVGRVTEVMERSGALDAT RESVERATRSALEAFEATGALGPAADQIREMVSRAVGREQ

Here is the full slurm output file: slurm-835700.txt

Thank you very much for your help! Paula

Arkadiy-Garber commented 4 years ago

Hi Paula,

Thanks for bringing my attention to this error. I have been aware of this bug, and am currently considering possible fixes.

The header format in your FASTA files is currently not compatible with FeGenie. This is because FeGenie was designed to expect prodigal-formatted headers, where each header ends in an underscore and number (e.g. contig_1, or contig_00001). FeGenie uses that number to remember where each ORF is encoded relative to other potential iron-related genes. If your header ends in some value followed by an underscore (such as 'protein'), it tries to convert that value to an integer, and fails, causing the program to crash. Does this make sense?

One potential fix is to require users to make sure that their provided gene sequences have headers that are formatted this way. Would you be able to reformat your fasta amino acid file this way?

Another potential fix is to allow users to provide headers that are not formatted this way, but forego the relative genomic localization feature of FeGenie. In this case, however, you would see a lot of false positives in your output files. This is because FeGenie uses the known operon structures of iron-related pathways to rule out false positive hits to iron gene HMMs that may be part of broader gene families, but not necessarily involved in iron-related processes. Genomic context is an important component of FeGenie's identification of iron genes, so I hesitate to make this second option available.

Let me know if you have any thoughts on this, or other questions

Thanks, Arkadiy

pdalcinmartins commented 4 years ago

Dear Arkadiy,

Thanks for your answer!

I can use contig fasta files and later figure out which genes FeGenie identified in my prokka-annotated amino acid fasta files - a bit more of work, but I want to use the prokka annotations for this and other analyses (and later for genome submissions).

As for FeGenie, I see the importance of genomic context. Not sure how feasible this is, but one idea is to allow users to specify how contigs and genes are designated in their input files via additional command line options - for instance, -contigs letters -genes numbers, maybe with a few restrictions, for example, gene numbers have to be between underscores. And then FeGenie would ignore whatever comes next (i.e. an annotation).

Best, Paula

Arkadiy-Garber commented 4 years ago

Hi Paula,

That sounds like a reasonable approach (using contigs, and then figuring out prokka ORFs corresponding to the genes identified in FeGenie. That does sound like extra work, but definitely doable.

I also like the idea of having additional options that users can use to specify contig names. I tried coding this into the script, which I just uploaded to the GitHub. Give it a try and see if it works for you. Assuming all headers are like the one you pasted in the above comment: "unbinned_NHGMMMNG_144129_2E-6E-farnesyl_diphosphate_synthase", if you add the flag -contig_names NHGMMMNG, then it might be able infer genomic context from the number following that (e.g. 144129). Does that make sense?

One issue that I am foreseeing is that if you have the prokka-generated name (NHGMMMNG) assigned to all contigs, and the genes ordered sequentially, then you might have cases where two geens might appear to be encoded adjacent to each other, but could just be on ends of different contigs. I think if your assembly isn't too fragmented, then this might not be too much of an issue, as long as you are aware of this possibility, and confirm that the FeGenie-identified gene clusters are, indeed, all on the same contigs.

Let me know if you have any other issues or question. I didn't have time to test out the new FeGenie script that I just uploaded to GitHub, so if there is some error generated, please let me know.

Thanks! Arkadiy