dib-lab / genome-grist

map Illumina metagenomes to genomes!
https://dib-lab.github.io/genome-grist/
Other
37 stars 6 forks source link

maximum number of gather hashes on report graph and finding numbers of hashes that don't match anything #266

Open jessicalumian opened 1 year ago

jessicalumian commented 1 year ago

Hello I have some questions!

  1. How can I modify the gather hashes vs mapped bp graph to show more than 60 genomes?
  2. How can I see the number of hashes that don't match anything in GTDB?
  3. Can I get the answers to 1 and 2 if I am using the GTDB database and providing another database in the same run?

Bonus question:

Is there a way to easily find out the amount of genome covered of a specific genome for different runs of genome-grist? Say I am looking for microbe X in five different microbiome samples and I want to know how many hashes match microbe X and what percentage of genome is covered in those samples. I imagine I could look at the report graphs but wondering if there's another way.

ctb commented 1 year ago

Hello I have some questions!

and I have answers!

1. How can I modify the gather hashes vs mapped bp graph to show more than 60 genomes?

The reports are generated from template notebooks in genome_grist/notebooks that are filled in and executed. The filled in notebooks are available in outputs.*/reports/*.ipynb, and you can actually run them directly from there and modify them.

In this case you want report-mapping-{sample}.ipynb. You should be able to modify the number 60 at the top of it = see NUM=60.

If there are things we can do to make this notebook easier to edit let me know :). Haven't paid much attention to it in a while...

2. How can I see the number of hashes that don't match anything in GTDB?

See outputs.*/{sample}.yaml. The unknown_hashes is what you want. See also total_hashes and known_hashes.

3. Can I get the answers to 1 and 2 if I am using the GTDB database and providing another database in the same run?

The numbers will be calculated with respect to the combined databases.

Bonus question:

Is there a way to easily find out the amount of genome covered of a specific genome for different runs of genome-grist? Say I am looking for microbe X in five different microbiome samples and I want to know how many hashes match microbe X and what percentage of genome is covered in those samples. I imagine I could look at the report graphs but wondering if there's another way.

hmm. ...yes... if I understand your question correctly...

outputs.*/gather/{sample}.gather.csv will contain the sourmash/hash information. You're looking for one of the columns f_orig_query,f_match,f_unique_to_query,f_unique_weighted,average_abund,median_abund,std_abund,f_match_orig,unique_intersect_bp for the row where name matches your desired microbe.

For the mapping coverage, look at outputs.*/mapping/{sample}.summary.csv. You're looking for f_covered_bp.

There are some details - like whether you want the stats for the metagenome x genome, or leftover metagenome x genome - but first I'd suggest that you go get confused by what's there and then come back and ask questions ;)

ctb commented 1 year ago

p.s. great questions!