Feature Request: support for additional contig/chromosome aliases

bpow commented 10 years ago

Short use case:

the convention at my institution is to use the individual chromosome's accession (like NC_000003.11) as the sequence identifier in BAM files. But I would like to look at the BAM files and corresponding tracks from other sources that use the conventional chr3- or 3-style names.

Discussion

The requested functionality would be something like the alias file functionality for IGV, as described at https://www.broadinstitute.org/software/igv/LoadData/#aliasfile

Dalliance already supports recognition of the chr3 and 3 names in BAM files by filling out the chrToIndex map in makeBam, this would allow extending this to permit additional chromosome name synonyms. I'm happy to code this if you think this would be a useful generalizable feature.

I would imaging that the alias names could go in the coordSystem object, with a list of lists of names that should be considered synonyms, this might look like:

coordSystem: {
  speciesName: human,
  sequenceAliases: [
    ['NC_000001.10', 'chr1', '1'],
    ['NC_000002.11', 'chr2', '2'],
    ['NC_000003.11', 'chr3', '3'],
    ...
  ],
  ...
}

dasmoth commented 10 years ago

There is actually a way this can be done already. Dalliance has supported loading data via coordinate-system mappings for a while now. Historically this has needed alignment-DAS servers, which I admit are a bit of a hassle. But as of Dalliance 0.12, mapping data can also be loaded from slightly-specialized bigBed files with columns as follows:

 string chrom;       "Reference sequence chromosome or scaffold"
 uint   chromStart;  "Start position in chromosome"
 uint   chromEnd;    "End position in chromosome"
 string ori;         "Orientation on dest sequence"
 string srcChrom;    "Source chromosome or scaffold name"
 uint   srcStart;    "Source start position"
 uint     srcEnd;      "Source end position"
 string srcOri;      "Orientation on source sequence"
 int    blockCount;  "Number of blocks in alignment"
 uint[blockCount] srcStarts;    "Offsets of block starts within source region"
 uint[blockCount] destStarts;   "Offsets of block starts within dest region"
 uint[blockCount] blockLens;  "Block lengths"

There's no reason why you can't use this scheme to define one-to-one mappings between two sequence IDs, e.g.:

  chr1 0 249250621 + NC_000001.10 0 249250621 + 1 0 0 249250621

Make a file with a line like this for each chromosome, run through bedToBigBed, then configure your sources to use this mapping.

That said, I agree this is a little cumbersome, and we'd be very interested in a contribution that added support for either your sequenceAliases tables, or IGV-style aliases files, or possibly both. My suggestion would be to re-use the existing coordinate-mapping functionality for this purpose, and treat the alias file as another possible source of mapping data. If you look in "chainset.js", there is a ChainFetcher interface, which an alternative mapping data source could implement.

dasmoth commented 10 years ago

The following configuration now works in latest git master:

            {
              name: "alias-test",
              uri: "http://www.biodalliance.org/datasets/tests/test-alias.bed",
              tier_type: "memstore",
              payload: "bed",
              sequenceAliases: [['twentytwo', '22']]
            },

dasmoth / dalliance

Feature Request: support for additional contig/chromosome aliases #64

Short use case:

Discussion