apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
168 stars 17 forks source link

decrease in the number of hits with the new version #18

Closed kayanac closed 1 year ago

kayanac commented 1 year ago

Hello,

Thank you for this great tool. I both used the old version (I dont remember the exact version number but I used it in November) and the new version (1.5.0) of the tool for the same samples. The new version gives less hits (generally 50% decrease) compared to the older version. I kept the virus_score at 0.7 for both versions. I am just wondering why there is such a dramatic change in the number of hits between two versions? I am a bit confused.

Best,

Kadir

apcamargo commented 1 year ago

Hi @kayanac,

In the version 1.5.0 I changed some internal search parameters to make sure that the search results would stay the same regardless of the number of splits. I tuned the parameters to try to minimize as much as possible changes in speed and sensitivity. A 2-fold decrease in sensitivity is certainly strange. Can you share the input with me so I can check which proteins are not being detected with the new parameters? I can use that to change the defaults in a future release.

For now, you can increase the amount of annotated proteins by raising the --sensitivity parameter (set to 4.2 by default). This will increase runtime, but should address your concern (the 1.5.0 release is faster than the previous version anyway).

kayanac commented 1 year ago

Thank you for your quick respond. I set the sensitivity to 6.0 and run one samples but nothing changed. What should be the sensitivity ? I can send you the input file, but the files are large and I cannot attach them to the this entry.

Additional comment: All contigs in my samples were identified as viral by a combination of some other viral identification tools (Virsorter, Virsorter2, VIBRANT, CheckV, and virfinder). With the older version of genomad, I had a classification ratio of over 80% which is quite consistent with the results from other viral identification tools. I am not sure whether I should consider the results from older version or the new version for my research.

Thanks.

apcamargo commented 1 year ago

Can you upload the file to something like https://transfer.sh/ or https://wetransfer.com/?

Without seeing your data, I'd say you can stick to the result of the previous version. The classification models are the same, the only change is the marker assignment step.

kayanac commented 1 year ago

Please use the following link for the contigs file: https://we.tl/t-ueaXx5EAmR . Please let me know what you think after you run the file.

Thanks,

Kadir

apcamargo commented 1 year ago

Thank you. I'll investigate and get back to you

apcamargo commented 1 year ago

@kayanac, I ran three different versions of geNomad using default parameters and these are the numbers I got. Given the timeframe you executed geNomad last year, I think you probably used version 1.2.0.

geNomad version MMseqs2 version Notes No. genes annotated
1.2.0 13-45111 Old MMseqs2 version 99,531
1.4.0 14-7e284 Sensitivity depends on --splits 105,512
1.5.1 14-7e284 105,609

As you can see, the sensitivity only increased. One thing that could have affected you results is the --splits parameter, since it affected the annotation prior to version 1.5.0. Do you remember if you used --splits and which value did you set it to?

kayanac commented 1 year ago

Hi @apcamargo, I did not use --splits. This is the command line that I used:

genomad end-to-end --min-score 0.7 NFEBM11_viralcontigs.fasta genomad-test/ Databases/genomad/genomad_db/ .

What do you think the problem could be?

apcamargo commented 1 year ago

Can you compress the output and send it to me? From the previous version too, if you still have it

kayanac commented 1 year ago

can I just send you summary file? The file size is too big. or which files do you want to have?

I have the following folders and files below in the resulting folder

NFEPM11_viralcontigs_aggregated_classification NFEPM11_viralcontigs_aggregated_classification.log NFEPM11_viralcontigs_annotate NFEPM11_viralcontigs_annotate.log NFEPM11_viralcontigs_find_proviruses NFEPM11_viralcontigs_find_proviruses.log NFEPM11_viralcontigs_marker_classification NFEPM11_viralcontigs_marker_classification.log NFEPM11_viralcontigs_nn_classification NFEPM11_viralcontigs_nn_classification.log NFEPM11_viralcontigs_summary NFEPM11_viralcontigs_summary.log

kayanac commented 1 year ago

I mean the output folder is too large.

apcamargo commented 1 year ago

NFEPM11_viralcontigs_annotate and NFEPM11_viralcontigs_marker_classification should be enough.

kayanac commented 1 year ago

Unfortunately, they are too large even after compression (> 1GB)

apcamargo commented 1 year ago

The NFEPM11_viralcontigs_marker_classification directory should be small. I need it to check how the classificationresults were affected.

As for NFEPM11_viralcontigs_annotate, you can send me just the _mmseqs2.tsv and the _genes.tsv files.

kayanac commented 1 year ago

newer_version_NFEPM11_viralcontigs_marker_classification.tar.gz newer_version_NFEPM11_viralcontigs_genes.tsv.tar.gz newer_version_NFEPM11_viralcontigs_mmseqs2.tsv.tar.gz older_version_NFEPM11_viralcontigs_marker_classification.tar.gz older_version_NFEPM11_viralcontigs_genes.tsv.tar.gz older_version_NFEPM11_viralcontigs_mmseqs2.tsv.tar.gz

apcamargo commented 1 year ago

Thanks! Can you also sent the .json file inside NFEPM11_viralcontigs_annotate? The execution parameters are stored there.

kayanac commented 1 year ago

jsonfiles.tar.gz

apcamargo commented 1 year ago

Thanks for the data, @kayanac. It seems that the marker-based classification is actually classifying more sequences as virus when you used the new version (38,659 versus 38,544). This makes sense, since newer releases tend to annotate more genes.

Your problem is probably downstream. Could you send me the files below?

Also, based on the json files you sent me, you actually used version 1.3.*.

apcamargo commented 1 year ago

Actually, I think I know what is going on. Most of your sequences are pretty short and since version 1.4.0 geNomad requires sequences that are less than 2,500 bp to have at least one hallmark gene to be classified as virus or plasmid. To disable this behavior, just set --min-virus-hallmarks-short-seqs 0 --min-plasmid-hallmarks-short-seqs 0. You can read more about the filters that geNomad uses here.

Keep in mind that you should be careful with short sequences that don't encode any hallmark gene, as there's a bigger chance of them being misclassifications. You can use CheckV to do some QC.

kayanac commented 1 year ago

Thank you. I tried your recommendation and now it looks liked it is fixed. I will double check them with CheckV.

apcamargo commented 1 year ago

Good to hear! If you want to identify as many viruses as possible, you can also use the --relaxed parameter, which will disable all the filters.

Let me know if you have any other questions :)