dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 41 forks source link

clustmap.py -q 30 option missing from command? #494

Closed zdzbinden closed 2 years ago

zdzbinden commented 2 years ago

I'm constructing an alignment of 100bp single-end ddRAD data to a reference genome. I noticed the SNP positions gleaned from the VCF file could sometimes be >100 and sometimes much greater (1200!).

I dug around the clustmap.py code and found a possible issue.

You note here the use of the -q 30 option should be used to discard low-quality reads at line 1880:

`# (cmd2) samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
    #  -b = write to .bam
    #  -q = Only keep reads with mapq score >= 30 (seems to be pretty standard)
    #  -F = Select all reads that DON'T have these flags.
    #        0x4 (segment unmapped)
    #        0x100 (Secondary alignment) 
    #        0x800 (supplementary alignment) (chimeric-like, not likely)
    #        0x71
    #        0xb1 
    #  -U = Write out all reads that don't pass the -F filter
    #       (all unmapped reads go to this file).`

However that option appears to be missing below at line 1930:

`    # sends unmapped reads to a files and will PIPE mapped reads to cmd3
    cmd2 = [
        ip.bins.samtools, "view",
        "-b",
        "-F", "0x904",
        "-U", ubamout,
        samout,
    ]
`

Does this mean no read quality is being applied?

Small unrelated issue: I while back I ran into an issue with aligning a large number of samples (~1700). The HPC system I use had a default ulimit set to 1024. This was causing ipyrad to stall but without warning or error. It worked fine after adjusted ulimit. I just thought I'd make you aware.

isaacovercast commented 2 years ago

@zdzbinden Thanks for catching that. I added the -q 30. Pretty funny to see the documentation correct, but not the code! Normally it's the other way around. Anyway, thanks for reporting, it'll go out in the next bioconda package.

Thanks also for reporting the ulimit problem with too many open files. I'm not sure there's anything we can really do about this, but it's a good start to document so others might be able to recognize the issue. Thanks a lot!

isaacovercast commented 2 years ago

Will be 0.9.86 once it passes checks in bioconda.

zdzbinden commented 2 years ago

Heads up, I got an error while running 0.9.86 that a string was expected, not an integer. Looking at the code I wonder if: "-q", 30, should be "-q", "30", I changed it on my end and no longer got the error.

isaacovercast commented 2 years ago

Darn it! Thanks for reporting. I fixed this and pushed a new version 0.9.87 which should be up on bioconda shortly.