add CIGAR operators for splices

saupchurch commented 9 years ago

To support explicit splice declaration it is proposed that CIGAR operators be added to support canonical, non-canonical, major, minor and potentially other types of splices. This issue is to discuss what operators may be needed as well as their implementation.

diekhans commented 9 years ago

With ~99% of the splice sites being GT-AG, this can be the only case that needs to needs an optimized representation in the CIGAR. All other cases could be stored as the actual 4-bases with minimal real overhead.

An estimation of cost from real data would be valuable.

lh3 commented 9 years ago

One option is to replace the SKIP operation (aka N in SAM) with SPLICE. So far, the N operator has only been used for splicing but nothing else. We could reuse CigarUnit::referenceSequence to keep the donor/acceptor pair, or add union {null,string} donor, acceptor to CigarUnit explicitly.

richarddurbin commented 9 years ago

I don't understand why the two flanking intron bases need to be stored in the CIGAR string. Surely they can be inferred from the reference that is being aligned to? I support redefining N to be SPLICE for backward compatibility, given that this has been the use of the N operator so far. Why replace for the sake of it?

Richard

On 15 Dec 2014, at 20:22, Heng Li notifications@github.com wrote:

One option is to replace the SKIP operation (aka N) with SPLICE. So far, the N operator has only been used for splicing but nothing else. We could reuse CigarUnit::referenceSequence to keep the donor/acceptor pair, or add union {null,string} donor, acceptor to CigarUnit explicitly.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

lh3 commented 9 years ago

I don't understand why the two flanking intron bases need to be stored in the CIGAR string. Surely they can be inferred from the reference that is being aligned to?

Retrieving the donor/acceptor strings from the reference genome may be inefficient.

I support redefining N to be SPLICE for backward compatibility, given that this has been the use of the N operator so far. Why replace for the sake of it?

In the schema, we only have SKIP. It is not obvious that it in fact refers to splicing. The proposal is to rename SKIP to SPLICE.

richarddurbin commented 9 years ago

I don't understand why the two flanking intron bases need to be stored in the CIGAR string. Surely they can be inferred from the reference that is being aligned to?

Retrieving the donor/acceptor strings from the reference genome may be inefficient.

If the argument is efficiency, then this should not be resolved by explicit encoding in CIGAR. Instead it should be a call on the alignment (or set of alignments), and the interface/server can decide whether to implement it by storing explicitly in its internal representation, or calculating on the reference. It does not seem that hard to me to extract from the reference. In any case, how often will users actually want to access these bases - that seems more of a QC thing than a primary thing.

Basically it seems to me a clear abuse of the CIGAR data, which represents the alignment, to try to put additional non-alignment data into it. I think it should be stored somewhere else, or accessed in some other way, if you want it.

I support redefining N to be SPLICE for backward compatibility, given that this has been the use of the N operator so far. Why replace for the sake of it?

In the schema, we only have SKIP. It is not obvious that it in fact refers to splicing. The proposal is to rename SKIP to SPLICE.

Yes, that is what I meant. I support renaming it. —

Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

delagoya commented 9 years ago

I agree with Richard, and support a redefinition of N to SPLICE, and that any interpretation of it should not be encoded in the CIGAR string, but handled in either the annnotations or derived from the reference.

lh3 commented 9 years ago

The schema defines requests, not storage. Suppose we want to know whether a splice site is canonical, how could we request and in what format should the server respond? The simplest and the most convenient way is to mix the splicing signal with CIGAR.

Also note that there is already a referenceSequence field in CigarUnit. I added it as a response to support the MD tag (there was an issue on this topic). If we don't want to keep splicing in CigarUnit, we should also remove this field as it violates the same principle.

richarddurbin commented 9 years ago

What is the motivation for this request?

I see it as parallel to, for example, wanting to know the flanking sequence for a TRADIS (transposon insertion sequence) read alignment.

On 17 Dec 2014, at 14:54, Heng Li notifications@github.com wrote:

The schema defines requests, not storage. Suppose we want to know whether a splice site is canonical, how could we request and in what format should the server respond? The simplest and the most convenient way is to mix the splicing signal with CIGAR.

Also note that there is already a referenceSequence field in CigarUnit. I added it as a response to support the MD tag (there was an issue on this topic). If we don't want to keep splicing in CigarUnit, we should also remove this field as it violates the same principle.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

saupchurch commented 9 years ago

It looks like there are two parts of this issue that have been identified:

1) Rename the CIGAR Skip to Splice (or something similar). This would be done to more properly document how Skip is being used in practice.

2) Categorization of splice sites. I agree that this does not really properly belong as part of the CIGAR which is focused on the alignment information. It is desirable to be able to retrieve the set of reads that have splices as well as know the types of those junctions. This seems to be a task for an API query method rather than the CIGAR.

How do we go about implementing (1)? Should discussion of (2) continue here, or would it be better to open a new issue for it separate from CIGAR discussion?

skeenan commented 9 years ago

This has been dormant since January. Could we have comment on whether this issue is has been resolved. Closing in 2 days unless objected.

ga4gh / ga4gh-schemas

add CIGAR operators for splices #211