apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
168 stars 17 forks source link

enable-score-calibration option #38

Closed elozanoe closed 3 months ago

elozanoe commented 9 months ago

Hello. I've been reading the documentation and the use of --enable-score-calibration is not clear to me. When I apply it in the end-to-end pipeline, it tells me that I have less than 1000 sequences and that another option will be used by default if I do not apply an automatic option.

I am working with prokaryotic genome assemblies (less than 1000 contigs) and I would like to know the estimated probabilities. What is the most recommended option? Should I do genomad end-to-end without any tags?

Thanks in advanced!

apcamargo commented 9 months ago

Hi @elozanoe

Score calibration is blocked by default for samples with less than 1,000 sequences because the calibration works well when you have lots of sequences (see this figure) because the estimate of the underlying composition is more reliable.

How many genomes you have? My recommendation is to concatenate all the assemblies into a single FASTA to have as many contigs as possible in a single input. You can then use the --force-auto parameter to force score calibration regardless of the number of sequences. I've never benchmarked calibration with few sequences extensively, but my guess (based on the figure I linked above) is that it would still improve classification performance as long as you have at least 400-500 contigs. It might work well enought with even less than that, but I'd take a careful look at the results just to be sure.

Please let me know what you find out!

elozanoe commented 9 months ago

Thank you for the prompt response. I am working with 10 assemblies, but I would like, if possible, to keep them in separate FASTA files. I understand that the use of score calibration is more suitable for metagenomes, but that's not my case. I am trying to determine the presence of plasmids and/or phages in the assemblies, and I would like to assess the probabilities using this pipeline.

If I understand correctly, the --enable-score-calibration flag is more suitable when there is a large number of sequences in order to correct for the composition of a metagenome/virome. However, in the case of a genome, could the scores from aggregated classification be considered reliable?

apcamargo commented 9 months ago

The score calibration process can be used for isolates too, as long as you have enough sequences in your sample. You could concatenate the assemblies just for the geNomad classification, if that doesn't cause any problems downstream.

In case you decide to not do calibration, you can use the uncalibrated aggregated scores, no problem. They won't correspond to actual probabilities, though. They will just represent a measure of how confident the model is that your sequence is a chromosome/virus/plasmid.

elozanoe commented 9 months ago

Understood. Thank you so much!

apcamargo commented 9 months ago

Sure thing!

elozanoe commented 7 months ago

Hi!

I'm reopening this issue to discuss a test I conducted with assemblies of three isolates from different species: one with 23, another with 158, and another with 264 contigs. To simplify the message, we'll refer to them as A, B, and C, respectively.

In this test, I calculated the plasmid_score and virus_score for each of the isolates concatenated with GenBank assemblies of the same species, both with and without using the --enable-score-calibration flag. Subsequently, I calculated the scores when only the isolates were present, this time using the --enable-score-calibration flag, --enable-score-calibration along with --force-auto (since there are less than 1000 contigs), and without using any flags. This results in five groups of score data for both plasmids and viruses.

A. Scores of concatenated assemblies using the --enable-score-calibration flag. B. Scores of concatenated assemblies without calibration. C. Scores of isolates using the --enable-score-calibration flag. D. Scores of isolates using the --enable-score-calibration and --force-auto flags. E. Scores of isolates without calibration.

I'm attaching the results for each species:

In the case of species A, no plasmids or viruses were found in any scenario. The results are consistent with other resources such as PlasmidFinder and PHASTEST.

For species B, the results for plasmid_score are:

seq_name A B C D E 17 0.7184 0.7053 nd nd nd 25 0.9595 0.8871 0.8804 0.9795 0.8871 27 0.9993 0.9928 0.9991 0.999 0.9928 30 0.9993 0.9901 0.9991 0.9989 0.9901 31 0.8397 0.7557 nd nd nd 34 0.9991 0.9839 0.9987 0.9989 0.9839 38 0.9989 0.9753 0.998 0.9986 0.9753 39 0.9993 0.991 0.9991 0.999 0.9915 41 0.8862 0.7488 0.7265 0.8815 0.7488 42 0.999 0.979 0.9988 0.9984 0.979 46 0.9945 0.9406 0.9854 0.996 0.9406 48 0.9613 0.8735 0.8858 0.9774 0.8735 53 0.9992 0.9869 0.9989 0.9989 0.9869

The most significant difference is that when analyzing the isolates separately, sequences 17 and 31 are not detected. There are also minor differences in sequences 41 and 48. To quantify these variances, I subtracted the results of test A from both C and D. Ideally, the values should be close to 0. indicating that the score with isolates is reliable despite the low number of contigs. Calculating the average, in each case, I obtained 0.029 (A - C) and 0.004 (A - D). Therefore, the use of --force-auto (even using only --enable-score-calibration) yields good results.

For species B, the results for virus_score are similar:

seq_name A B C D E 45 nd nd 0.7439 0.8051 nd 57 0.9973 0.9602 0.9972 0.9989 0.9602 59 0.9826 0.9132 0.958 0.9925 0.9132 71 0.9877 0.9163 0.9861 0.9941 0.9163 72 0.989 0.9209 0.9865 0.9949 0.9209 84 0.999 0.9741 0.9983 0.9997 0.9741 86 0.9848 0.9243 0.9552 0.9939 0.9243 92 0.9466 0.8418 0.8655 0.9781 0.8418 93 0.9979 0.9609 0.9958 0.9993 0.9609 94 0.9984 0.9665 0.9974 0.9994 0.9665 108 0.9199 0.814 0.8032 0.9676 0.814 110 0.9983 0.9655 0.9973 0.9994 0.9655 116 0.9985 0.9671 0.9976 0.9994 0.9671 140 0.9973 0.957 0.9941 0.9991 0.957 145 0.9972 0.9568 0.9936 0.9991 0.9568 146 0.9973 0.9571 0.9937 0.9991 0.9571 11|provirus_65_16279 0.998 0.9675 0.9981 0.9992 0.9675 14|provirus_87_44881 0.9982 0.9666 0.9979 0.9993 0.9666 2|provirus_83538_96175 0.9988 0.9757 0.9986 0.9996 0.9757 24|provirus_66_14786 0.9933 0.9389 0.9948 0.9967 0.9389 33|provirus_5869_12592 0.9992 0.9793 0.9988 0.9997 0.9793 6|provirus_265_17197 0.9943 0.948 0.9968 0.9972 0.948 8|provirus_134573_141514 0.9987 0.9737 0.9985 0.9995 0.9737

In this case, interestingly, sequence 45 is detected only in isolates when calibrated. The overall results remain similar.

The results for plasmid_score and virus_score for species C are:

seq_name A B C D E 48 0.9501 0.9239 0.9458 0.9659 0.9239 55 0.999 0.9885 0.9989 0.9991 0.989 69 0.7037 0.8067 nd nd nd 75 0.9594 0.9214 0.9558 0.972 0.9214 108 0.9612 0.9241 0.9577 0.9734 0.9241 119 0.9486 0.9204 0.9441 0.9648 0.9204 121 0.713 0.7713 nd nd nd 124 0.9283 0.9091 0.9223 0.9511 0.9091 130 0.8209 0.7461 0.8191 0.8406 0.7461

seq_name A B C D E 2|provirus_90622_150516 0.9985 0.972 0.9984 0.9988 0.972 11|provirus_78642_111075 0.9982 0.9686 0.9981 0.9987 0.9686 22|provirus_50240_70641 0.9728 0.8687 0.9676 0.9819 0.8687

Based on the previous calculations, for this strain, it seems more favorable to use --enable-score-calibration without --force-auto.

Based on these tests, I've chosen to keep the isolates separate since the results are very similar to when the genomes are concatenated. The reason for this decision is my concern that introducing other assemblies may bias the probability estimation in one direction or another, leading to less reliable results. Could this happen?

I believe this would require a more in-depth analysis, but I present it as a starting point for other users who may have similar concerns. I hope this information is helpful. In my case, I decided to use only the --enable-score-calibration flag. Do you consider this correct?

Thanks in advance!

apcamargo commented 7 months ago

Thanks for the analysis!

Based on these tests, I've chosen to keep the isolates separate since the results are very similar to when the genomes are concatenated. The reason for this decision is my concern that introducing other assemblies may bias the probability estimation in one direction or another, leading to less reliable results. Could this happen?

Yes. The calibration will be affected by the sequences you concatenate with your assembly. Different sequences will result in a different calibration. In average, the classification performance will increase. But in your case, you're not interested in the average, only on a subset of the sequences (the ones from your assembly). So, I recommend using --enable-score-calibration, without concatenation.