bcgsc / biobloom

Create Bloom filters for a given reference and then use it to categorize sequences
http://www.bcgsc.ca/platform/bioinfo/software/biobloomtools
GNU General Public License v3.0
75 stars 15 forks source link

Check virus from human fastq #81

Closed lh12565 closed 1 week ago

lh12565 commented 2 years ago

Hi, I want to check if there is any viral sequence in my human fastq. I run the BBT according to "Integrated Genomic Characterization of Papillary Thyroid Carcinoma" said. The abundance metric is calculated from the following formula (RPM): clipboard In the result file, for exmaple: virus Is NC001716's abundance metric (RPM) calculated like this: (96095/sum of chr1-M hits)*10^6 ? BTW, is it possible to combine human filters(chr1-22) into one filter? I just use the Homo_sapiens_assembly38.fasta to generated filter, so there are multiple chr filter_id there. Thanks!

JustinChu commented 2 years ago

By default, each run of biobloommaker produces one filter (.bf) per command. That is using a single fasta file should collapse it into a single filter. For example:

biobloommaker -p hsapiens Homo_sapiens_assembly38.fasta

Would produce a single filter. Extra work is needed if you are trying to split by fasta header. Are you running a script that splits is for you? Can you show me the commands executed?

Note the miBF executables (i.e. biobloommimaker) can automatically split by header when producing a single multi-index Bloom filter but -F can be used to split by a list of files. However, I should note that the 2018 publication you listed uses the normal Bloom filter code base.

lh12565 commented 2 years ago

By default, each run of biobloommaker produces one filter (.bf) per command. That is using a single fasta file should collapse it into a single filter. For example:

biobloommaker -p hsapiens Homo_sapiens_assembly38.fasta

Would produce a single filter. Extra work is needed if you are trying to split by fasta header. Are you running a script that splits is for you? Can you show me the commands executed?

Note the miBF executables (i.e. biobloommimaker) can automatically split by header when producing a single multi-index Bloom filter but -F can be used to split by a list of files. However, I should note that the 2018 publication you listed uses the normal Bloom filter code base.

Hi @JustinChu I run the biobloommaker as follows, but there is a error:

biobloommaker -p hb hb.kc.fa
biobloommaker -p hbv hbv.kc.fa
biobloommimaker -t 10 -p human_vir -S "01111011100000011111101110000111100100000101001001011100011000010111101001100011 10101101001111010011100101110001111111111010010000110011001000100100011100001000 00010000111000100100010011001100001001011111111110001110100111001011110010110101 11000110010111101000011000111010010010100000100111100001110111111000000111011110" Homo_sapiens_assembly38.fasta

biobloommicategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
#Error: human.bf hb.bf hbv.bf File cannot be opened
JustinChu commented 2 years ago

biobloomcategorizer and biobloommicategorizer are different executables. For filters files created with biobloommaker only use biobloomcategorizer and for filters created with biobloommimaker use biobloommicategorizer.

JustinChu commented 2 years ago

Try something like

biobloommaker -p hb hb.kc.fa
biobloommaker -p hbv hbv.kc.fa
biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta

biobloomcategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
lh12565 commented 2 years ago

Try something like

biobloommaker -p hb hb.kc.fa
biobloommaker -p hbv hbv.kc.fa
biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta

biobloomcategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz

Hi, when I run

biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta

there is a warning:

Allocating 32490737280 bits of space for filter and will output filter this size (plus header)
Approximated (due to false positives) total unique k-mers in reference files 2388759546
Writing a 4061342160 byte filter to human.bf on disk.
The ratio between redundant k-mers and unique k-mers is approximately: 0.273768
Consider checking your files for duplicate sequences and adjusting them accordingly.
High redundancy will cause filter sizes used overestimated, potentially resulting in a larger than needed filter.
Alternatively you can set the number of elements wanted in the filter with (-n) and ignore this message.
Filter Creation Complete.

How to get a non-redundant k-mer of Homo_sapiens.fa? Thanks!

JustinChu commented 2 years ago

I wouldn't worry about the warning. It just means redundant k-mers are collapsed and the filter becomes more sparse with a lower FPR. It doesn't really cause issues other than a larger-than-needed filter. If memory is an issue you can set the size -n to approximately the total number of unique k-mers.

lh12565 commented 2 years ago

Hi, @JustinChu I found if I used the k-mer fasta to run biobloomcategorizer, there is no hit in this fasta. But if I used the raw fasta, I got some hits. Are these hits credible?

JustinChu commented 2 years ago

What do you mean by k-mer fasta? At any rate BBT needs redundant hits to confidentiality classify, the intent is to use raw fasta I would expect.

lh12565 commented 2 years ago

The k-mer fasta is k-mer collection file for genomes, such as: http://opengene.org/viral.kc.fasta.gz

If I use the raw fasta, and run biobloomcategorizer, like:

biobloomcategorizer --fa -e -p s113368 -f "human.bf hb.bf hbv.bf hb7.bf ebv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz

The result is as follows: filter

The shared ratio of hb, hbv, hb7, ebv [virus filter] are very high. Is there any way to determine if there is a virus infection. If I use the RPM to determine virus infection as mentioned above, which >0.2 means virus infection said in the supplementary materials of the reference[1], all the viruses here are infected, especially hb7: 166116/732786286*10^6=226.6909 Thanks!

[1] Integrated Genomic Characterization of Papillary Thyroid Carcinoma

JustinChu commented 2 years ago

Were the genomes of viral.kc.fasta.gz used to create your viral filters? If you input the original file used to generate the filters you should get a match so that is a little strange. I'm not really sure why the viral.kc.fasta.gz file didn't work, it looks like a collection of viral genomes.

I think there is probably a lot of overlap with human sequences in your results maybe possibly low complexity sequence and there are also a lot of inactive endogenous viral sequences in the human genome. You can probably check by filtering by human first and pipe the output to another run of biobloomcategorizer. If there are still a lot of hits then it might suggest an infection.

Something like this:

biobloomcategorizer --fa -e -p s113368_viral -f "hb.bf hbv.bf hb7.bf ebv.bf" <(biobloomcategorizer --fa -dn -e -p s113368_human -f human.bf/NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz)
lh12565 commented 2 years ago

Were the genomes of viral.kc.fasta.gz used to create your viral filters? If you input the original file used to generate the filters you should get a match so that is a little strange. I'm not really sure why the viral.kc.fasta.gz file didn't work, it looks like a collection of viral genomes.

I think there is probably a lot of overlap with human sequences in your results maybe possibly low complexity sequence and there are also a lot of inactive endogenous viral sequences in the human genome. You can probably check by filtering by human first and pipe the output to another run of biobloomcategorizer. If there are still a lot of hits then it might suggest an infection.

Something like this:

biobloomcategorizer --fa -e -p s113368_viral -f "hb.bf hbv.bf hb7.bf ebv.bf" <(biobloomcategorizer --fa -dn -e -p s113368_human -f human.bf/NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz)

Hi @JustinChu How many hits or ratio can represent an infection? My result as below: s113368_human_summary.tsv

filter_id       hits    misses  shared  rate_hit        rate_miss       rate_shared
human   732786286       97880416        0       0.882166        0.117834        0
multiMatch      0       830666702       0       0       1       0
noMatch 97880416        732786286       0       0.117834        0.882166        0

s113368_viral_summary.tsv

filter_id       hits    misses  shared  rate_hit        rate_miss       rate_shared
hb      89      97880327        88      9.09273e-07     0.999999        8.99056e-07
hbv     54      97880362        53      5.51694e-07     0.999999        5.41477e-07
hb7     482     97879934        90      4.92438e-06     0.999995        9.19489e-07
ebv     0       97880416        0       0       1       0
multiMatch      90      97880326        0       9.19489e-07     0.999999        0
noMatch 97879932        484     0       0.999995        4.94481e-06     0

Thanks!

JustinChu commented 2 years ago

The results suggest hb7 is present and the others have shared hits due to sequence similarity to hb7. As the required coverage for infection is dependent on DNA extraction methods and viral load so I don't know what coverage makes sense. All I'm pretty sure of is that hb7 is present but you can look more closely with the small number of filtered sequence to be sure.