Arkadiy-Garber / FeGenie

HMM-based identification and categorization of iron genes and iron gene operons in genomes and metagenomes
GNU Affero General Public License v3.0
54 stars 12 forks source link

false negatives? #35

Open metalichen opened 2 years ago

metalichen commented 2 years ago

Hey! I do have another question!

After annotating my MAGs, I saw that FeGenie didn't find any transport-related clusters in any of my MAGs, which wouldn't make sense biologically (I have, among others, several cyanobacterial MAGs, and they must get their iron somewhere, right?). If I use the --all_results flag, I get some transport genes, but I'm not sure I should use them, since you mention in a different thread that this flag can create false-positives.

I imagine something goes wrong during the clustering step? I looked into one MAG specifically. According to the output produced by --all_results, it has the three EfeUOB genes, all next to each other, but they don't show up when I run the same MAG in strict mode. Are the other genes that should be present for the cluster to be complete?

Sorry for the basic question, I'm very new to the iron metabolism world :)

I can send you the MAG I looked into, or the output files, if needed.

Thanks!

Arkadiy-Garber commented 2 years ago

Hi Gulya,

Your reasoning makes total sense. It does seem like something is going wrong with the clustering step, and I suspect that is where the problem lies. If the three EfeUOB genes are encoded next to each other, they should definitely be picked up by FeGenie (without the --all_results flag). Are you by chance running FeGenie with the --orfs or --gbk mode? And which MAG is it?

Welcome to the iron metabolism world :) it gets confusing at times, but everyone gets along and helps each other out. Plus, we have good coffee.

Arkadiy

metalichen commented 2 years ago

I was using --orfs (which, now when I think about it, would mean that fegenie might not know that these genes are next to each other?). And I looked into private_T1916_metawrap_bin.6

Arkadiy-Garber commented 2 years ago

Thanks, Gulya. You are exactly correct. When providing the --orfs flag, FeGenie skips the step where it clusters genes based on where they are encoded on the genome/contig. I need to make this clear in the README, or implement into FeGenie some kind of way to guess coordinates based on the order in which ORFs are listed in the FASTA file. Although, with the latter, there is potential to run into issues if the provided ORFs come from a highly fragmented assembly.

If you provide genbank files, along with the --gbk flag, that should allow FeGenie to keep track of the relative positions of ORFs on each contig. Otherwise, contigs are also another potential input, but in this case, FeGenie will run prodigal and generate new gene calls.

From the MAGs that you emailed me, it seems that you annotated with Prokka? Prokka also uses prodigal for ORF prediction, so the gene calls should be same, but with a different name. In any case, it wouldn't be very difficult to consolidate the two sets of ORFs.

Let me know if you have any other questions, or if anything here doesn't make sense!