bede / hostile

Precise host read removal
MIT License
78 stars 4 forks source link

Hostile with no options classifying different than --invert #42

Open jannikseidelQBiC opened 2 months ago

jannikseidelQBiC commented 2 months ago

Hi and first, thanks for the great work.

I tried to run Hostile to get the filtered result files and the removed read-pairs (Illumina paired-end data as input). What caught my eye is that the two results do not match: reads_removed in the first output should be the same as reads_out in the second (and the other combination).

Mode reads_removed reads_out
no option 19870638 42475288
--invert 42896358 19449568
Difference to 'no option' 421070 -421070

The commands I used (installation of Hostile 1.1.0 via conda):

hostile clean --fastq1 <file_forward>.fq.gz --fastq2 <file_reverse>.fq.gz --out-dir filtered_1 > log1_filtered.log
hostile clean --fastq1 <file_forward>.fq.gz --fastq2 <file_reverse>.fq.gz --out-dir removed_1 --invert > log1_removed.log

It seams that running with the --invert flag does a different classification than without. Am I missing an option to set to get the same results?

Thanks in advance!

PS: Here are the log files.

[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla",
        "options": [],
        "fastq1_in_name": "<file_forward>.fq.gz",
        "fastq1_in_path": "<path_to_files>/<file_forward>.fq.gz",
        "fastq1_out_name": "<file_forward>.clean_1.fastq.gz",
        "fastq1_out_path": "filtered_1/<file_forward>.clean_1.fastq.gz",
        "reads_in": 62345926,
        "reads_out": 42475288,
        "reads_removed": 19870638,
        "reads_removed_proportion": 0.31872,
        "fastq2_in_name": "<file_reverse>.fq.gz",
        "fastq2_in_path": "<path_to_files>/<file_reverse>.fq.gz",
        "fastq2_out_name": "<file_reverse>.clean_2.fastq.gz",
        "fastq2_out_path": "filtered_1/<file_reverse>.clean_2.fastq.gz"
    }
]
[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla",
        "options": [
            "invert"
        ],
        "fastq1_in_name": "<file_forward>.fq.gz",
        "fastq1_in_path": "<path_to_files>/<file_forward>.fq.gz",
        "fastq1_out_name": "<file_forward>.clean_1.fastq.gz",
        "fastq1_out_path": "removed_1/<file_forward>.clean_1.fastq.gz",
        "reads_in": 62345926,
        "reads_out": 19449568,
        "reads_removed": 42896358,
        "reads_removed_proportion": 0.68804,
        "fastq2_in_name": "<file_reverse>.fq.gz",
        "fastq2_in_path": "<path_to_files>/<file_reverse>.fq.gz",
        "fastq2_out_name": "<file_reverse>.clean_2.fastq.gz",
        "fastq2_out_path": "removed_1/<file_reverse>.clean_2.fastq.gz"
    }
]
bede commented 2 months ago

Hi Jannik, thank you, this is interesting. From your data there certainly appears to be a problem with how --invert is implemented. By any chance are you able to send me some (or all) of your test data?

Bede

jannikseidelQBiC commented 2 months ago

Hi Bede, the dataset I cannot share. Could you try to reproduce the behavior with another dataset? If it depends on only this dataset this would be also highly interesting.

Best, Jannik

bede commented 2 months ago

Thank you – that's understandable. I will investigate using other data.

On Wed, 11 Sep 2024 at 07:39, Jannik Seidel @.***> wrote:

Hi Bede, the dataset I cannot share. Could you try to reproduce the behavior with another dataset? If it depends on only this dataset this would be also highly interesting.

Best, Jannik

— Reply to this email directly, view it on GitHub https://github.com/bede/hostile/issues/42#issuecomment-2342780609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHWAAFC3GIBWGFGIMC7BRTZV7QSTAVCNFSM6AAAAABN3ZL4TKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBSG44DANRQHE . You are receiving this because you commented.Message ID: @.***>