option for no smith-waterman in the case of only looking at snps and speed issues with many 0 depth loci

wheaton5 commented 4 years ago

Hey Ian,

This is just a thought. I realize that the smith-waterman to both haplotypes helps sort out alignment edge effects with indels, but does it do anything for SNPs? I realize that a nearby indel can still cause problems, but I don't know if vartrix uses nearby variants or not. And in any case, if I feed it a vcf with only SNPs, it won't know about nearby indels. I wonder how much faster it would be to have a purely pileup matching strategy as a non-default option?

Also I was running vartrix using 1kgenomes variants filtered to >=2 percent allele frequency and it was taking quite a while (I think I killed it after 6 hours and I was using 8 cores). So what I am doing now is I am using a combination of samtools depth, awk, and bedtools to filter to regions that have > x coverage (for my purposes, cov >= 8 or so) which drops me down from tens of millions of variants to ~120k variants and then it runs in a very reasonable amount of time. This works for me, but what do you think the issue is here? Bam fetches take time even if they retrieve 0 reads or only a few reads I guess. Even if you just use samtools depth + some filtering steps under the hood, it might be nice to include these as options (--min-coverage).

Anyway, just some potential improvement suggestions.

Best, Haynes

ifiddes-10x-zz commented 4 years ago

Hi Haynes,

Thanks for the suggestions. The main reason I went with the full SW alignment is that in addition to the indels, STAR introduces all sorts of weird alignments, as you saw in your attempts to get good variant calls. I was also inspired by the way LongRanger works to assign barcodes to haplotypes.

A pure pileup version would definitely be faster, and could be a good option. There is CIGAR parsing already to check if an alignment is a N-op over the region of interest. This could be extended.

I suspect that your runtime concerns are bam fetch related, but also slightly more complicated -- the reason I filter the N-op issue mentioned above is that STAR likes to make crazy huge splices that span hundreds of kb. BAM fetch will still pull these in. It may be possible to use the pileup API to bypass this?

arogozhnikov commented 4 years ago

My issue has probably the same cause:

vartrix and cellSNP take too long (demuxlet runs 10X faster, while allele counting is only part of its pipeline). And it seems to me that those are stuck at Y-chromosome (while data contains only X), thus probably connected to this issue

pmarks commented 4 years ago

@wheaton5 could you send me the VCF you're using? I'll do some profiling to understand why it's slow. You can send it to me via redstone: https://support.10xgenomics.com/docs/transfers

wheaton5 commented 4 years ago

@pmarks I have them on my google drive and you should be able to wget them with the following commands

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=15s8zvIit2UO-2lnL2DnsL0YFoR3AWWRF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=15s8zvIit2UO-2lnL2DnsL0YFoR3AWWRF" -O filtered_2p_1kgenomes_GRCh38.vcf && rm -rf /tmp/cookies.txt

for GRCh38 and

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ICfIhpA4iGPEz_lAZf6RLMFQlrfgaskL' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ICfIhpA4iGPEz_lAZf6RLMFQlrfgaskL" -O filtered_2p_1kgenomes_hg19.vcf && rm -rf /tmp/cookies.txt

for hg19

pmarks commented 4 years ago

@wheaton5 what aligner created the BAM file that you're using?

I profiled the vcf you sent me against a 10k PBMC BAM file from Cell Ranger (so STAR alignments), and I get this:

82% of the time is spent reading alignments from the BAM file
17% is spent doing the alignment.
The actual fetch operation is negligible.

So my guess is that it's taking a bunch of time loading the crazy STAR records with massive refskip blocks that don't have any actual bases aligned to the target region. In your VCF, there's a SNP every 300bp, so we're probably reloading those long records many times. We could test this idea by filtering out those records (I think they're a small fraction of the total, right?).

So I think our options are: a) get rid of the weird records. Probably a longer term project for Cell Ranger, could be a little preprocessing tool that either drops those records or trims off the most egregious examples. b) add a variant filtering step like the one you implemented, maybe with configurable depth c) process blocks of nearby variants together using a fetch to get all the reads needed

Any thoughts?

wheaton5 commented 4 years ago

@pmarks

I am also testing vs a STAR aligned bam. Specifically the one here

wget https://sra-pub-src-1.s3.amazonaws.com/SRR5398235/A.merged.bam.1 -O A.merged.bam
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2560nnn/GSM2560245/suppl/GSM2560245_barcodes.tsv.gz
gunzip GSM2560245_barcodes.tsv.gz

I think that is a good plan. But how to do this efficiently? Seems like you will need to make a new bam file. What I do is to filter the vcf to loci with > min_depth (default 8) using samtools depth + awk + bedtools (there might be faster ways but this is reasonable and is low memory).

I would love if 10x would move to hisat2 for cellranger, but I realize that is unlikely. The alignments are much more conducive to variant calling.

wheaton5 commented 4 years ago

@pmarks

Another issue related to this.

When vartrix does realignment across a locus which in hisat2 or STAR has been called as a splice site, it can make big mistakes because I assume vartrix's smith-waterman is not splice-aware. For this I would either like a --ignore_spliced_reads or --no_realignment or --ignore_reads_in_splice_regions where the last one would be okay evaluating a read at a location for which that read is not currently cigar N, but would ignore that read if that read's current alignment contains an N at this base of interest.

10XGenomics / vartrix

option for no smith-waterman in the case of only looking at snps and speed issues with many 0 depth loci #28