bede / hostile

Precise host read removal
MIT License
78 stars 4 forks source link

virus masked human genome #30

Closed xiaoli-dong closed 9 months ago

xiaoli-dong commented 11 months ago

Do you have plan to create the reference genome with virus masked?

bede commented 11 months ago

Thanks @xiaoli-dong, I am also curious about accuracy with viral genomes, and I plan to investigate.

My expectation is that the default (human-t2t-hla) reference genome will already perform well for viruses, but I will reply to this issue once I have made some progress. Thanks for reminding me.

bede commented 10 months ago

Hi @xiaoli-dong, My testing shows that a small number of sequences (below 0.00% of viral RefSeq 150mers) are not retained when using the default human-t2t-hla reference. These almost exclusively belong to Herpesvirales.

To provide perfect RefSeq viral retention, I have made a new masked reference that masks 0.18% of the human genome. I will make this available in coming days and update this issue when ready.

bede commented 10 months ago

Apologies for delay, I plan to publish virus masked databases imminently. In the process of building them I made improvements to the masking protocol, which have been committed to the main branch.

As well as viral RefSeq, I will be making available a database masked against the phage genome collection produced by members of Andrew Millard's lab: https://millardlab.org/phage-genomes-jan-2024

bede commented 9 months ago

I have added two human genomes masked against viral and phage genomes (link to readme):

With the latest version of Hostile, you can use these by running e.g.

hostile clean --index human-t2t-hla.rs-viral-202401_ml-phage-202401 --fastq1 reads.fastq.gz

The first time you run this command, the index will be downloaded.

Feedback welcome

charlesfoster commented 2 months ago

Hi @bede,

In case it's of use, I can provide some feedback on the performance of hostile using the human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 reference. My sample comprises paired-end reads where I know the majority of them will be from respiratory syncytial virus A (RSVA), a herpesvirus.

My initial human (host) decontamination was performed with bowtie2 like so:

INDEX=GRCh38
bowtie2 \
    -x $INDEX \
    -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz \
    --threads 12 \
    --un-conc-gz sample.host_unmapped.fastq.gz \
    --very-sensitive-local \
    1> /dev/null \
    2> >(tee R1399.bowtie2.log >&2)

mv sample.host_unmapped.fastq.1.gz sample.host_unmapped_1.fastq.gz
mv sample.host_unmapped.fastq.2.gz sample.host_unmapped_2.fastq.gz

Results:

255485 reads; of these:
  255485 (100.00%) were paired; of these:
    251866 (98.58%) aligned concordantly 0 times
    1599 (0.63%) aligned concordantly exactly 1 time
    2020 (0.79%) aligned concordantly >1 times
    ----
    251866 pairs aligned concordantly 0 times; of these:
      14 (0.01%) aligned discordantly 1 time
    ----
    251852 pairs aligned 0 times concordantly or discordantly; of these:
      503704 mates make up the pairs; of these:
        503331 (99.93%) aligned 0 times
        170 (0.03%) aligned exactly 1 time
        203 (0.04%) aligned >1 times
1.49% overall alignment rate

I ran hostile like so:

hostile clean --fastq1 R1399_trimmed_1.fastp.fastq.gz --fastq2 R1399_trimmed_2.fastp.fastq.gz --index human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 > log.json

Results:

[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401",
        "options": [],
        "fastq1_in_name": "sample_R1.fastq.gz",
        "fastq1_in_path": "/Users/charles/Programs/hostile/sample_R1.fastq.gz",
        "fastq1_out_name": "sample_R1.clean_1.fastq.gz",
        "fastq1_out_path": "/Users/charles/Programs/hostile/sample_R1.clean_1.fastq.gz",
        "reads_in": 510970,
        "reads_out": 504656,
        "reads_removed": 6314,
        "reads_removed_proportion": 0.01236,
        "fastq2_in_name": "sample_R2.fastq.gz",
        "fastq2_in_path": "/Users/charles/Programs/hostile/sample_R2.fastq.gz",
        "fastq2_out_name": "sample_R1.clean_1.fastq.gz",
        "fastq2_out_path": "/Users/charles/Programs/hostile/sample_R2.clean_2.fastq.gz",
    }
]

After some renaming and comparing:

Total reads in bowtie2.host_unmapped_1.fastq.gz: 251866
Total reads in hostile.host_unmapped_1.fastq.gz: 252328
Unique reads in bowtie2.host_unmapped_1.fastq.gz: 130
Unique reads in hostile.host_unmapped_1.fastq.gz: 592
Common reads: 251736

I took the first read that was unique to each method and blasted them against the nt database. Results:

  1. Manual bowtie2 with GRCh38: the read matched "Homo sapiens 3 BAC RP11-406M8 (Roswell Park Cancer Institute Human BAC Library) complete sequence"
  2. hostile: equally best hit with a few of the same/essentially the same things, e.g. "PREDICTED: Pan paniscus interleukin 1 receptor associated kinase 4 (IRAK4), transcript variant X1, mRNA", "Homo sapiens 12 BAC RP11-210N13 (Roswell Park Cancer Institute Human BAC Library) complete sequence"

Thoughts: looks like both hostile and bowtie2 are retaining some host reads. hostile is discarding less reads as contamination, without further in-depth searching it's hard to say whether these are extra retained non-host reads or extra erroneously retained host reads.

I do note that the following is said in the Limitations section of the README:

Hostile prioritises retaining microbial sequences above discarding host sequences. If you strive to remove every last human sequence, other approaches may serve you better

But I just thought this might be useful nonetheless.

Cheers