Closed lh12565 closed 1 week ago
By default, each run of biobloommaker
produces one filter (.bf
) per command. That is using a single fasta file should collapse it into a single filter. For example:
biobloommaker -p hsapiens Homo_sapiens_assembly38.fasta
Would produce a single filter. Extra work is needed if you are trying to split by fasta header. Are you running a script that splits is for you? Can you show me the commands executed?
Note the miBF executables (i.e. biobloommimaker
) can automatically split by header when producing a single multi-index Bloom filter but -F
can be used to split by a list of files. However, I should note that the 2018 publication you listed uses the normal Bloom filter code base.
By default, each run of
biobloommaker
produces one filter (.bf
) per command. That is using a single fasta file should collapse it into a single filter. For example:biobloommaker -p hsapiens Homo_sapiens_assembly38.fasta
Would produce a single filter. Extra work is needed if you are trying to split by fasta header. Are you running a script that splits is for you? Can you show me the commands executed?
Note the miBF executables (i.e.
biobloommimaker
) can automatically split by header when producing a single multi-index Bloom filter but-F
can be used to split by a list of files. However, I should note that the 2018 publication you listed uses the normal Bloom filter code base.
Hi @JustinChu I run the biobloommaker as follows, but there is a error:
biobloommaker -p hb hb.kc.fa
biobloommaker -p hbv hbv.kc.fa
biobloommimaker -t 10 -p human_vir -S "01111011100000011111101110000111100100000101001001011100011000010111101001100011 10101101001111010011100101110001111111111010010000110011001000100100011100001000 00010000111000100100010011001100001001011111111110001110100111001011110010110101 11000110010111101000011000111010010010100000100111100001110111111000000111011110" Homo_sapiens_assembly38.fasta
biobloommicategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
#Error: human.bf hb.bf hbv.bf File cannot be opened
biobloomcategorizer
and biobloommicategorizer
are different executables. For filters files created with biobloommaker
only use biobloomcategorizer
and for filters created with biobloommimaker
use biobloommicategorizer
.
Try something like
biobloommaker -p hb hb.kc.fa
biobloommaker -p hbv hbv.kc.fa
biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta
biobloomcategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
Try something like
biobloommaker -p hb hb.kc.fa biobloommaker -p hbv hbv.kc.fa biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta biobloomcategorizer -t 40 --fa -e -p ccle -f "human.bf hb.bf hbv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
Hi, when I run
biobloommaker -t 10 -p human Homo_sapiens_assembly38.fasta
there is a warning:
Allocating 32490737280 bits of space for filter and will output filter this size (plus header)
Approximated (due to false positives) total unique k-mers in reference files 2388759546
Writing a 4061342160 byte filter to human.bf on disk.
The ratio between redundant k-mers and unique k-mers is approximately: 0.273768
Consider checking your files for duplicate sequences and adjusting them accordingly.
High redundancy will cause filter sizes used overestimated, potentially resulting in a larger than needed filter.
Alternatively you can set the number of elements wanted in the filter with (-n) and ignore this message.
Filter Creation Complete.
How to get a non-redundant k-mer of Homo_sapiens.fa? Thanks!
I wouldn't worry about the warning. It just means redundant k-mers are collapsed and the filter becomes more sparse with a lower FPR. It doesn't really cause issues other than a larger-than-needed filter. If memory is an issue you can set the size -n
to approximately the total number of unique k-mers.
Hi, @JustinChu I found if I used the k-mer fasta to run biobloomcategorizer, there is no hit in this fasta. But if I used the raw fasta, I got some hits. Are these hits credible?
What do you mean by k-mer fasta? At any rate BBT needs redundant hits to confidentiality classify, the intent is to use raw fasta I would expect.
The k-mer fasta is k-mer collection file for genomes, such as: http://opengene.org/viral.kc.fasta.gz
If I use the raw fasta, and run biobloomcategorizer, like:
biobloomcategorizer --fa -e -p s113368 -f "human.bf hb.bf hbv.bf hb7.bf ebv.bf" /NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz
The result is as follows:
The shared ratio of hb, hbv, hb7, ebv [virus filter] are very high. Is there any way to determine if there is a virus infection. If I use the RPM to determine virus infection as mentioned above, which >0.2 means virus infection said in the supplementary materials of the reference[1], all the viruses here are infected, especially hb7: 166116/732786286*10^6=226.6909 Thanks!
[1] Integrated Genomic Characterization of Papillary Thyroid Carcinoma
Were the genomes of viral.kc.fasta.gz used to create your viral filters? If you input the original file used to generate the filters you should get a match so that is a little strange. I'm not really sure why the viral.kc.fasta.gz file didn't work, it looks like a collection of viral genomes.
I think there is probably a lot of overlap with human sequences in your results maybe possibly low complexity sequence and there are also a lot of inactive endogenous viral sequences in the human genome. You can probably check by filtering by human first and pipe the output to another run of biobloomcategorizer
. If there are still a lot of hits then it might suggest an infection.
Something like this:
biobloomcategorizer --fa -e -p s113368_viral -f "hb.bf hbv.bf hb7.bf ebv.bf" <(biobloomcategorizer --fa -dn -e -p s113368_human -f human.bf/NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz)
Were the genomes of viral.kc.fasta.gz used to create your viral filters? If you input the original file used to generate the filters you should get a match so that is a little strange. I'm not really sure why the viral.kc.fasta.gz file didn't work, it looks like a collection of viral genomes.
I think there is probably a lot of overlap with human sequences in your results maybe possibly low complexity sequence and there are also a lot of inactive endogenous viral sequences in the human genome. You can probably check by filtering by human first and pipe the output to another run of
biobloomcategorizer
. If there are still a lot of hits then it might suggest an infection.Something like this:
biobloomcategorizer --fa -e -p s113368_viral -f "hb.bf hbv.bf hb7.bf ebv.bf" <(biobloomcategorizer --fa -dn -e -p s113368_human -f human.bf/NAS/lh/down/wgs/113368_1_1.wholegenome.fastq.gz /NAS/lh/down/wgs/113368_1_2.wholegenome.fastq.gz)
Hi @JustinChu How many hits or ratio can represent an infection? My result as below: s113368_human_summary.tsv
filter_id hits misses shared rate_hit rate_miss rate_shared
human 732786286 97880416 0 0.882166 0.117834 0
multiMatch 0 830666702 0 0 1 0
noMatch 97880416 732786286 0 0.117834 0.882166 0
s113368_viral_summary.tsv
filter_id hits misses shared rate_hit rate_miss rate_shared
hb 89 97880327 88 9.09273e-07 0.999999 8.99056e-07
hbv 54 97880362 53 5.51694e-07 0.999999 5.41477e-07
hb7 482 97879934 90 4.92438e-06 0.999995 9.19489e-07
ebv 0 97880416 0 0 1 0
multiMatch 90 97880326 0 9.19489e-07 0.999999 0
noMatch 97879932 484 0 0.999995 4.94481e-06 0
Thanks!
The results suggest hb7 is present and the others have shared hits due to sequence similarity to hb7. As the required coverage for infection is dependent on DNA extraction methods and viral load so I don't know what coverage makes sense. All I'm pretty sure of is that hb7 is present but you can look more closely with the small number of filtered sequence to be sure.
Hi, I want to check if there is any viral sequence in my human fastq. I run the BBT according to "Integrated Genomic Characterization of Papillary Thyroid Carcinoma" said. The abundance metric is calculated from the following formula (RPM): In the result file, for exmaple: Is NC001716's abundance metric (RPM) calculated like this: (96095/sum of chr1-M hits)*10^6 ? BTW, is it possible to combine human filters(chr1-22) into one filter? I just use the Homo_sapiens_assembly38.fasta to generated filter, so there are multiple chr filter_id there. Thanks!