Understanding Inputs and Outputs

sophiaaredas commented 12 months ago

Hello, I am doing some analysis on my dataset and have been reading through the documentation but I would like more clarification to make the best judgement.

For the input on the command line I understand that the --splits parameter splits the search into chunks but I am unsure exactly what the example means by "--splits 8" ? I have been putting in different numbers and I do not notice a difference in my results but does the number 8 in this case refer to the threads being used by the computer?
When looking at my summary output file (my virus summary file in my case) I notice that the output says there is predicted 7 viral genes, virus score = .9428, but the number of hallmarks = 0. Judging from the virus score it looks promising that there could be viral genes (because the score is closer to 1) but I am confused as to why the number of hallmarks would be zero if there are possibly 7 genes? My input settings are set to the default setting but my next step would be to put the --conservative parameter but I would like to hear your input thanks!

apcamargo commented 12 months ago

Hi @sophiaaredas ,

You should not expect any difference in the results. That parameter only controls the memory usage of the profile search. If you split your search into multiple chunks it will use less memory, but will be a bit slower. If your computer can run geNomad without splitting the search, you shouldn't worry about it.
What do you mean by 7 viral genes? Can you share this part of the output? In any case, you can have sequences classified as virus even when they don't encode a hallmark. One such case would be a sequence full of genes that are enriched in viruses, but whose function is not known.

sophiaaredas commented 11 months ago

Thank you so much this makes sense now. Now I understand that I have 7 predicted viral genes encoded by my sequence despite there being no hallmark! Much appreciated!

apcamargo / genomad

Understanding Inputs and Outputs #40