Closed xiaoli-dong closed 9 months ago
Thanks @xiaoli-dong, I am also curious about accuracy with viral genomes, and I plan to investigate.
My expectation is that the default (human-t2t-hla
) reference genome will already perform well for viruses, but I will reply to this issue once I have made some progress. Thanks for reminding me.
Hi @xiaoli-dong, My testing shows that a small number of sequences (below 0.00% of viral RefSeq 150mers) are not retained when using the default human-t2t-hla reference. These almost exclusively belong to Herpesvirales.
To provide perfect RefSeq viral retention, I have made a new masked reference that masks 0.18% of the human genome. I will make this available in coming days and update this issue when ready.
Apologies for delay, I plan to publish virus masked databases imminently. In the process of building them I made improvements to the masking protocol, which have been committed to the main branch.
As well as viral RefSeq, I will be making available a database masked against the phage genome collection produced by members of Andrew Millard's lab: https://millardlab.org/phage-genomes-jan-2024
I have added two human genomes masked against viral and phage genomes (link to readme):
human-t2t-hla.rs-viral-202401_ml-phage-202401
(human masked against virus and phage)human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401
(human masked against bacteria, virus and phage)With the latest version of Hostile, you can use these by running e.g.
hostile clean --index human-t2t-hla.rs-viral-202401_ml-phage-202401 --fastq1 reads.fastq.gz
The first time you run this command, the index will be downloaded.
Feedback welcome
Hi @bede,
In case it's of use, I can provide some feedback on the performance of hostile
using the human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401
reference. My sample comprises paired-end reads where I know the majority of them will be from respiratory syncytial virus A (RSVA), a herpesvirus.
My initial human (host) decontamination was performed with bowtie2
like so:
INDEX=GRCh38
bowtie2 \
-x $INDEX \
-1 sample_R1.fastq.gz -2 sample_R2.fastq.gz \
--threads 12 \
--un-conc-gz sample.host_unmapped.fastq.gz \
--very-sensitive-local \
1> /dev/null \
2> >(tee R1399.bowtie2.log >&2)
mv sample.host_unmapped.fastq.1.gz sample.host_unmapped_1.fastq.gz
mv sample.host_unmapped.fastq.2.gz sample.host_unmapped_2.fastq.gz
Results:
255485 reads; of these:
255485 (100.00%) were paired; of these:
251866 (98.58%) aligned concordantly 0 times
1599 (0.63%) aligned concordantly exactly 1 time
2020 (0.79%) aligned concordantly >1 times
----
251866 pairs aligned concordantly 0 times; of these:
14 (0.01%) aligned discordantly 1 time
----
251852 pairs aligned 0 times concordantly or discordantly; of these:
503704 mates make up the pairs; of these:
503331 (99.93%) aligned 0 times
170 (0.03%) aligned exactly 1 time
203 (0.04%) aligned >1 times
1.49% overall alignment rate
I ran hostile
like so:
hostile clean --fastq1 R1399_trimmed_1.fastp.fastq.gz --fastq2 R1399_trimmed_2.fastp.fastq.gz --index human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 > log.json
Results:
[
{
"version": "1.1.0",
"aligner": "bowtie2",
"index": "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401",
"options": [],
"fastq1_in_name": "sample_R1.fastq.gz",
"fastq1_in_path": "/Users/charles/Programs/hostile/sample_R1.fastq.gz",
"fastq1_out_name": "sample_R1.clean_1.fastq.gz",
"fastq1_out_path": "/Users/charles/Programs/hostile/sample_R1.clean_1.fastq.gz",
"reads_in": 510970,
"reads_out": 504656,
"reads_removed": 6314,
"reads_removed_proportion": 0.01236,
"fastq2_in_name": "sample_R2.fastq.gz",
"fastq2_in_path": "/Users/charles/Programs/hostile/sample_R2.fastq.gz",
"fastq2_out_name": "sample_R1.clean_1.fastq.gz",
"fastq2_out_path": "/Users/charles/Programs/hostile/sample_R2.clean_2.fastq.gz",
}
]
After some renaming and comparing:
Total reads in bowtie2.host_unmapped_1.fastq.gz: 251866
Total reads in hostile.host_unmapped_1.fastq.gz: 252328
Unique reads in bowtie2.host_unmapped_1.fastq.gz: 130
Unique reads in hostile.host_unmapped_1.fastq.gz: 592
Common reads: 251736
I took the first read that was unique to each method and blasted them against the nt database. Results:
bowtie2
with GRCh38: the read matched "Homo sapiens 3 BAC RP11-406M8 (Roswell Park Cancer Institute Human BAC Library) complete sequence"hostile
: equally best hit with a few of the same/essentially the same things, e.g. "PREDICTED: Pan paniscus interleukin 1 receptor associated kinase 4 (IRAK4), transcript variant X1, mRNA", "Homo sapiens 12 BAC RP11-210N13 (Roswell Park Cancer Institute Human BAC Library) complete sequence"Thoughts: looks like both hostile
and bowtie2
are retaining some host reads. hostile
is discarding less reads as contamination, without further in-depth searching it's hard to say whether these are extra retained non-host reads or extra erroneously retained host reads.
I do note that the following is said in the Limitations section of the README:
Hostile prioritises retaining microbial sequences above discarding host sequences. If you strive to remove every last human sequence, other approaches may serve you better
But I just thought this might be useful nonetheless.
Cheers
Do you have plan to create the reference genome with virus masked?