apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

Different protein number from genomad and pyrodigal-gv #103

Closed quliping closed 4 weeks ago

quliping commented 4 weeks ago

Hello, genomad is a good software and very helpful for my work. However, I got some strange problems. I tested genomad v1.8.0 using a small data set containing 1269 contigs. In the 'final_overlapped_virus_annotate' folder of genomad's outputs, there are 5880 proteins in the 'final_overlapped_virus_proteins.faa' file. However, I got 5885 proteins from the same 1269 contigs using the pyrodigal-gv in the conda environment of genomad. My command is 'pyrodigal-gv -p meta -i final_overlapped_virus.fasta -a final_overlapped_virus-pyrodigal-gv_single.faa -o pyrodigal-gv.out'. Among which 'final_overlapped_virus.fasta' is the test data containing the 1269 contigs. I found there are 41 different protein ids between genomad and pyrodigal-gv results. May I ask if genomad uses some special parameters of pyrodigal-gv?

Here is the test data containing the 1269 contigs: final_overlapped_virus.zip

apcamargo commented 4 weeks ago

Thanks for sharing the data!

One possible cause for this discrepancy is that geNomad uses the mask option when performing gene prediction (see here). Another potential explanation is that if proviruses were detected in some of your sequences, the host-encoded genes within those proviral regions were removed, leading to a smaller number of predicted genes in the geNomad output.

quliping commented 4 weeks ago

Thanks for sharing the data!

One possible cause for this discrepancy is that geNomad uses the mask option when performing gene prediction (see here). Another potential explanation is that if proviruses were detected in some of your sequences, the host-encoded genes within those proviral regions were removed, leading to a smaller number of predicted genes in the geNomad output.

Thanks for your kindly reply. I tested pyrodigal-gv again with the '-m' option, and finally I got the same protein prediction result as the genomad output.