Randomize reads before quantification with Salmon

jaclyn-taroni commented 6 years ago

Context

Salmon assumes that the reads you give it are in a random order -- see the first Note here in the Salmon docs.

Problem or idea

Because we won't know if the fastq.gz files from SRA have been ordered in some way without checking, we might want to shuffle the reads before giving it to salmon quant. It is not clear to me how frequently these files will be in a non-random order (if ever).

Solution or next step

Based on my understanding of #157 (and by extension, how things are currently implemented), we have to unzip the fastq.gz file to determine the read lengths. That might be the time to shuffle the reads, but this needs a dev team perspective. Also, as noted above, I don't know how many samples this will effect.

cgreene commented 6 years ago

I would rather ask upstream if they can implement a flag that would randomize first at a small cost of compute.

rob-p commented 6 years ago

It seems a reasonable assumption that this should almost never occur in the wild. I am specifically couching my language here, because, as we all know, if something can happen in genomics, it will. However, the raw reads are expected to come off the sequencer in a sufficiently random order. The only real reason you might see ordered reads (that I can easily thing of), is if someone “back-converts” a sorted BAM to FASTQ files or some such. Hopefully, this is infrequent enough to ignore, at least in a first pass.

jaclyn-taroni commented 6 years ago

Alright cool, my suspicion was “almost never” :+1: I will close this.

durbrow commented 5 years ago

FYI As it stands now, for SRA:

For aligned submissions (i.e. SAM/BAM/CRAM), the order is the aligned position. This is true even if you are extracting FASTQ.
Otherwise, the order is mainly whatever the submitter sent.

The second factor is that SRA stores mate-pairs together, so there can be some reordering to make that happen. Also some of the tools for reading SRA have options that can effect the output order.

There is a general need in bio-informatics for a tool to sort reads in random order. Unfortunately, neither FASTQ nor SAM is amenable to simply piping into sort -R.

cgreene commented 5 years ago

@durbrow : do you store (and it it made accessible) whether the submission was an aligned one or not?

cgreene commented 5 years ago

@durbrow : also, do you know what fraction of submissions of RNA-seq samples are aligned vs. what the submitter sent?

durbrow commented 5 years ago

@cgreene wrt RNA-seq, I'm sorry, I don't know that. I deal mostly with the technical side of things (I am one of the SRA devs). From the technical side, there is very little special that is done for RNA-seq in SRA. For example, we don't record in the file whether or not it is RNA-seq. That information is probably available from the metadata for the sample. I can put you in touch with our data curators? They may know, or be able to run a query to get the answer.

cgreene commented 5 years ago

@durbrow : if there are folks there who could let us know how frequently this occurs and/or if there is a way to know that it has occurred so that we can run something slightly different in those cases, it would be wonderful to be put in touch with them. Thank you!

durbrow commented 5 years ago

@cgreene wrt is aligned vs unaligned. Yes we do. We have tools that do provide that information. Our vdb-dump tool can show you what the underlying database is, the tables it contains, the columns, data types, etc, as well as the data. In particular, all aligned SRA objects have a PRIMARY_ALIGNMENT table, you can probably guess what it contains. No unaligned SRA objects contain that table. For example: vdb-dump --table_enum SRR5054227 would give you a list of the tables in that run, one of them is PRIMARY_ALIGNMENT

We generally don't talk about these details. Our tools try to present a uniform interface to the data so that our users don't have to worry about it. So you can get FASTQ from an aligned submission, or SAM from an unaligned one.

durbrow commented 5 years ago

@cgreene You should be able to contact the data curation team here sra@ncbi.nlm.nih.gov

durbrow commented 5 years ago

BTW, in my opinion, this is the best way to get info about SRA objects. If you have a run accession, you can navigate the SRA hierarchy, find sample and project metadata, find other submissions associated with it, the articles they were published in, etc.

https://www.ncbi.nlm.nih.gov/sra/?term=SRR5054227

Change term=X as needed for different accessions

dampierch commented 5 years ago

I use a similar (but way less sophisticated) workflow that feeds SRA-derived FASTQ into Salmon for quantification. I've been trying to ignore the read order issue, but @durbrow's response makes it harder to do that, especially with this HowTo wiki encouraging sorted BAMs. (I have never submitted samples to SRA, so I haven't actually used that workflow.)

FWIW, this small-scale test I ran on the 39 tumor:adjacent normal pairs (78 observations each under sorted and shuffled conditions for a total of what should be 156 marks on the PCA) in the TCGA COAD project makes me think the impact of inadvertently using sorted input is not dramatic:

For TCGA single-end samples, I shuffled the reads with this one-liner from Aaron Quinlan, et al:

$awk -v OFS='\t' '{getline seq; getline sep; getline qual; print $0,seq,sep,qual}' ${name}.fastq | shuf | awk -v OFS='\n' '{print $1,$2,$3,$4}' > ${name}.sh.fastq

I used this Perl script from the chloroExtractorTeam to shuffle the paired-end reads.

I haven't implemented shuffling yet for SRA, but I'm starting to think I should. Judging from your repo, I think you will probably come up with something better than I will.

cgreene commented 5 years ago

@dampierch : I'm curious - if you shuffled the reads twice, what does the run-to-run variability look like? Does the sorted version look just like an arbitrary ordering of reads, or is there something systematic going on there that makes the difference larger?

dampierch commented 5 years ago

@cgreene: I shuffled each tissue sample's set of reads one time, but the paired nature of my dataset (tumor and tumor-adjacent "normal" tissue) resulted in two sets of reads for each subject; hence all the data points on my plot despite just 39 pairs in my comparison. I think I make this more confusing every time I write it. (My PI is looking forward to the end of this project.)

I am probably misunderstanding your question, but my (i.e. TCGA's) sorted reads are coordinate-sorted. I do not understand the EM algorithm enough to judge whether a potential bias on parameter estimation arising from processing a series of reads from (the beginning of) chromosome 1, then from chromosome 2, etc. is likely to have an effect on abundance estimates dramatic enough to have meaningful biological consequence. I'm skeptical, but as I'm ignorant, I'm also fearful. Do you have a view on this (i.e. parameter estimation and biological interpretation), @rob-p?

dampierch commented 5 years ago

@rob-p, sorry for bringing this up again. i see you addressed it in 2015, 2016 (indirectly), and 2017. if you include @jaclyn-taroni's original opening of the present issue, you have addressed the problem every year for the past four. you now have the opportunity to keep your streak alive.

@cgreene, in case you have not already seen this sort-of-related old tweet, i bring it to your attention as a counterpoint to my PCA plot. I think FPKMs are not the right unit, and my relative expression summary is more relevant than an absolute value comparison, not to mention they were using eXpress, but I'm still impressed with the divergence in estimates driven by read order (if that's really what was going on). I'd like to do a more formal comparison of coordinate-sorted vs shuffled input.

rob-p commented 5 years ago

Hi @dampierch, you certainly make a good argument for me to keep my streak alive :). There are, I think, a few points to address. The first is restricted purely to "alignment-based" salmon --- where the quantification is done directly from the aligned BAM file, while the second is the effect of observing the reads (or the corresponding alignments) in a given order. So, I'll address the potential issues in decreasing order of severity (i.e. how much they would bork the analysis):

In alignment-based mode, using a coordinate sorted file for paired-end reads is completely wrong and will lead to highly inaccurate results. This is because salmon makes basic assumptions, like RSEM does, that (a) all of the alignments for a given read will be adjacent in the BAM file and (b) records for the alignments of the ends of a paired end read will always be consecutive in the BAM file so that if, for example, read 1 multi maps to transcripts A,B and C, the alignments would look like:

<alignment of 1 (left end) to A>
<alignment of 1 (right end) to A>
<alignment of 1 (left end) to B>
<alignment of 1 (right end) to B>
<alignment of 1 (left end) to C>
<alignment of 1 (right end) to C>

if the alignments are not properly grouped like this, salmon is likely to write a bunch of verbose warnings to the log (i.e. both to cerr and log file) complaining about suspicious pairs.

In either alignment-based or mapping-based mode, what is the effect of observing the reads in a specific "sorted" order versus a random order? Here, the answer is more subtle, and the magnitude of the effect generally less extreme. Salmon has both an "online" phase and an "offline" phase. During the "online" phase, salmon performs estimation of transcript abundance, but also estimates "auxiliary" parameters of the experiment --- things like the fragment length distribution, the sequencing error model (in alignment-based mode), and sequence-specific and fragment-GC bias if these are enabled. Additionally, it also places the mapping information into (often range-factorized) equivalence classes for later "offline" processing. The read order only has a potentially adverse effect during this online phase. The reason it can have an adverse effect is because the online estimation algorithm assumes that observations (i.e. reads and their corresponding mappings) are arriving in a random order. Intuitively, this is because the online algorithm updates the estimates of the abundance of each transcript after every batch of reads it observes. If the reads are arriving in a random order, a given batch is a (very noisy) but unbiased sample from the true distribution. However, if the reads arrive in a given order (say all the reads from transcript A are observed first), then the algorithm will learn a bias in the abundances early on because the rate at which reads arrive that map to A will be very high early on in the stream of observations, quickly leading the algorithm to infer that A has very high abundance. So, the non-random read order affects the online estimates of the transcript abundances, but for the same reason, it also affects the estimates of the auxiliary parameters, since they typically require weighting each update based on the probability of fragments arising from each of the transcripts to which they match ... which depends on the online estimates of transcript abundance.

Now, the good news. Once all reads have been observed, salmon begins the offline phase where it uses the (range-factorized) equivalence classes to iteratively re-estimate transcript abundance until those estimates converge. During the offline phase, the order in which the reads were observed has no specific effect, because what is being processed is a reduced representation of the mapping information. The effect of the read order on this phase happens only in two ways (1) the initialization of the offline VBEM algorithm is given by a combination of the online estimates of transcript abundance and the uniform (i.e. uninformative) abundance vector (2) the weights in the (range-factorized) equivalence classes are estimated under the auxiliary parameters learned during the online phase. However, the fact that this offline phase is always run, typically for hundreds or thousands of iterations, until convergence, means that the magnitude of the effect of mis-estimating transcript abundances by observing the reads in order during the online phase is greatly mitigated. That means that different read orders don't tend to lead to total failures of estimation as you observe in the tweet above (though, in that case, it's not clear that salmon would be happy anyway because the alignment records in the coordinate sorted BAM are not contiguous and the pairs are not adjacent). However, the exact magnitude of the effect is hard to analyze theoretically, and depends on the extent to which the auxiliary models are affected as well.

dampierch commented 5 years ago

thank you, @rob-p, for your characteristically thorough and helpful response. may this constitute your last exposition on the read order issue for 2019.

alanhoyle commented 4 years ago

Another year, another question about this: would it be sufficient to take my STAR aligned BAMs and run them through samtools collate?

samtools collate possibly_sorted_by_alignment.bam grouped_by_read_name which should save grouped_by_read_name.bam

dampierch commented 4 years ago

Hi @alanhoyle , I think the samtools collate approach should work, although I haven't tried it. One issue that could give you trouble is that the QNAME field may not always be available in the BAM file of interest. SRA often stores read names but does not guarantee read name retention. See this issue comment for a brief introduction to the read name problem in SRA.

AlexsLemonade / refinebio