ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Sort ordering and position handling for unmapped mate pairs of mapped reads #224

Open iliat opened 9 years ago

iliat commented 9 years ago

SAM format best practices states: "For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate." This leads to a coordinate based read ordering when reading SAM/BAM files when the unmapped mate pari of a mapped read follows right after it even though it technically should have no position information. Some Picard / GATK tools depend on this and assert/fail if this is not the case.

However the current GA4GH API does not specify what should happen in this case and e.g. Google implementation currently has these unmapped reads without position/reference info and returns them after all the mapped reads in the sort order (which causes grief if such data is then pumped into Picard tools).

Note that things get more complicated with multi read fragments.

I am opening this issue to see if we can come to a principled stance on: 1.Should an unmapped mate pair of a mapped read have position and reference data of its mapped mate as per SAM best practice. 2.What should the sort order be for such reads (including multi read fragment case).

My strawman proposal: 1.Do NOT specify position and reference for unmapped reads. 2.Do return unmapped mates of a mapped read right after it in the sort order. Thus the sort order for all reads in a fragment should be: a) All mapped reads in the coordinate order b) All unmapped mates ordered by read length

calbach commented 9 years ago

Perhaps @lh3 or @fnothaft have some ideas here?

calbach commented 9 years ago

Also, I made this claim to Ilia offline:

the current GA4GH API does not specify what should happen in this case

but just found that the behavior is in fact specified in the GA4GH protocol.

record SearchReadsResponse {
  /**
  The list of matching alignment records, sorted by position.
  Unmapped reads, which have no position, are returned last.
  */
  array<org.ga4gh.models.ReadAlignment> alignments = [];

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/readmethods.avdl#L60

So I guess the question is around whether or not the currently specified behavior is correct.

fnothaft commented 9 years ago

@calbach In ADAM, we implement the same ordering as the GA4GH protocol.

Some Picard / GATK tools depend on this and assert/fail if this is not the case.

Personally, I don't like that design decision. We've encountered a few places where the sort order invariants lead to functionally incorrect behavior in Picard/GATK (e.g., duplicate marking can't process chimeric read pairs and BQSR can't process read pairs with overlapping alignments). This is a longer design argument to have though; with the GATK architecture, I think it would be intractable from a performance perspective to restructure various algorithms so that you can remove that invariant.

ying-w commented 9 years ago

One advantage of unmapped reads being kept next to their mapped pair in bams is for structural variation analysis. These analysis often look for reads where one read is mapped while the other read is either unmapped or mapped to a far-away place. It makes it easier to grab reads together instead of going to the end of the bam every time to search for the unmapped read.

Another possible reason why unmapped reads are kept together is that it could make it easier to randomly fragment a bam file for say parallel processing. Keeping all the unmapped reads at the end means the last chunk of the bam would have the same number of reads but could have very skewed metrics.

Personally, I think for an API unmapped reads should come at the end but there should be some way to easily query for all other reads in a fragment (including other mate pairs, somewhat mentioned here)

skeenan commented 9 years ago

This has been dormant since January. I am closing in 2 days unless objected.

david4096 commented 7 years ago

Related https://github.com/ga4gh/schemas/issues/274