4dn-dcic / pairix

1D/2D indexing and querying on bgzipped text file with a pair of genomic coordinates
MIT License
83 stars 13 forks source link

Two questions about the pairs format #56

Closed lh3 closed 6 years ago

lh3 commented 6 years ago
  1. The example in the spec gives two strands, one of each pos1 and pos2. However, I speculate that only one relative strand (=strand1*strand2) is needed. For example, do these two lines make difference in downstream processing?

    EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 + -
    EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 - +
  2. Another example in the spec shows that only one of the following two lines should be retained:

    EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
    EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr1 10000 + +

    which makes sense. However, is it legitimate to encode a triplet with identical first column like

    EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + +
    EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr3 10000 + +

Thanks!

SooLee commented 6 years ago

Hi Heng Li,

  1. That's an interesting point, in terms of optimizing space. I think it depends on the downstream process. For example, if you want to map restriction enzyme sites to each of the two mates, the absolute strand of each mate would would make a difference, because one would expect the relevant restriction site would be on the 3' side of the read.

  2. We didn't specify the first column as a key, so it's legitimate to have the same read id multiple times. Currently we don't have a formal recommendation about how to encode triplets in a pairs file. Do you have a case where you'd like to add triplets?

Best, Soo

On Mon, Apr 9, 2018, 12:26 PM Heng Li notifications@github.com wrote:

1.

The example https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md#example-pairs-file in the spec gives two strands, one of each pos1 and pos2. However, I speculate that only one relative strand (=strand1*strand2) is needed. For example, do these two lines make difference in downstream processing?

EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 + - EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 - +

2.

Another example in the spec shows that only one of the following two lines should be retained:

EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + + EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr1 10000 + +

which makes sense. However, is it legitimate to encode a triplet with identical first column like

EAS139:136:FC706VJ:2:1286:25:275154 chr1 10000 chr2 2000 + + EAS139:136:FC706VJ:2:1286:25:275154 chr2 2000 chr3 10000 + +

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/4dn-dcic/pairix/issues/56, or mute the thread https://github.com/notifications/unsubscribe-auth/AA63bG-ulojdJpNzHTsY8qg36xFl2pFVks5tm4u9gaJpZM4TM1iz .

lh3 commented 6 years ago

Thanks, @SooLee!