Using example code with local sequence file

shihabdider commented 5 years ago

I'm trying to do some benchmarking of CRAM-JS and am having some trouble with the example code.

I have three cram files and their corresponding crai indices as well as the reference sequences in separate fasta files. I'd like to use the local reference sequences to read the cram files, but I'm not sure how to change the seqFetch argument to do this. Currently the code just seems to generate a fake sequence of A's and the documentation is not entirely clear on how to use the local sequence file.

Also there might be some bugs with fetch records portion of the code (e.g the await is not inside an async)

const { IndexedCramFile, CramFile, CraiIndex } = require('@gmod/cram')

// open local files
const indexedFile = new IndexedCramFile({
  cramPath: require.resolve(`./data/ce#5.tmp.cram`),
  index: new CraiIndex({
    path: require.resolve(`./data/ce#5.tmp.cram.crai`),
  }),
  seqFetch: async (seqId, start, end) => {
    // seqFetch should return a promise for a string.
    // this one just returns a fake sequence of all A's of the proper length
    let fakeSeq = ''
    for (let i = start; i <= end; i += 1) {
      fakeSeq += 'A'
    }
    return fakeSeq
  },
  checkSequenceMD5: false,
})

// example of fetching records from an indexed CRAM file.
// NOTE: only numeric IDs for the reference sequence are accepted
const records = await indexedFile.getRecordsForRange(0, 10000, 20000)
records.forEach(record => {
  console.log(`got a record named ${record.readName}`)
  record.readFeatures.forEach(({ code, pos, refPos, ref, sub }) => {
    if (code === 'X')
      console.log(
        `${
          record.readName
        } shows a base substitution of ${ref}->${sub} at ${refPos}`,
      )
  })
})

// can also pass `cramUrl`, and `url` params to open remote URLs

cmdcolin commented 5 years ago

You might consider implementing seqFetch with https://github.com/GMOD/indexedfasta-js or https://github.com/gmod/twobit-js. There is not any example code but the basic idea should be fairly straightforward. I guess ideally those modules would not be the bottleneck in CRAM decoding but it could be something to be aware of to make sure the sequence retrieval is not the bottleneck (could add cache layer or sequence chunking)

Also for the async/await, it's true it needs wrapping in an async function technically. Ideally the JS community comes to some conclusion about "top level await" https://github.com/tc39/proposal-top-level-await

cmdcolin commented 5 years ago

Note also the use of seqid in the cram file instead of a sequence name e.g. chr1. The SAM header in the cram file can map that ID to the name if it exists (not all cram files have a SAM header but most do)

shihabdider commented 5 years ago

I was able to implement the seqFetch using indexedfasta-js, but I'm not having trouble with the seqid as you mentioned. The cram files I'm working with are the standard NA12878 human genomes (low coverage and exome respectively). Peeking inside, I can see that it uses the sequence names instead of numerical ids. Could explain what you meant by mapping the seqids to the names?

Here's the first few lines of the cram file.

@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:248956422 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:6aef897c3d6ff0c78aff06ac189178dd SP:Human @SQ SN:chr2 LN:242193529 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:f98db672eb0993dcfdabafe2a882905c SP:Human @SQ SN:chr3 LN:198295559 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:76635a41ea913a405ded820447d067b0 SP:Human @SQ SN:chr4 LN:190214555 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:3210fecf1eb92d5489da4346b3fddc6e SP:Human @SQ SN:chr5 LN:181538259 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:a811b3dc9fe66af729dc0dddf7fa4f13 SP:Human @SQ SN:chr6 LN:170805979 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:5691468a67c7e7a7b5f2a3a683792c29 SP:Human @SQ SN:chr7 LN:159345973 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:cc044cc2256a1141212660fb07b6171e SP:Human @SQ SN:chr8 LN:145138636 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:c67955b5f7815a9a1edfaa15893d3616 SP:Human @SQ SN:chr9 LN:138394717 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus _decoy_hla.fa AS:GRCh38 M5:6c198acf68b5af7b9d676dfdd531b5de SP:Human @SQ SN:chr10 LN:133797422 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_ set_plus_decoy_hla.fa AS:GRCh38 M5:c0eeee7acfdaf31b770a509bdaa6e51a SP:Human

cmdcolin commented 5 years ago

The seqid would just be whatever order the sequence name appears in that header

shihabdider commented 5 years ago

I see, so seqid of 0 would correspond to 'chr1' above, 1 to 'chr2' etc?

cmdcolin commented 5 years ago

Yep that's my understanding

shihabdider commented 5 years ago

Alright so if I use 0 as the seqid, I'm now getting an error where the ref and the sub properties are showing up as undefined. Likewise for the seqid property of record:

got a record named SRR622461.139 SRR622461.139 shows a base substitution of undefined->undefined at 9998 seqid: undefined SRR622461.139 shows a base substitution of undefined->undefined at 9999 seqid: undefined SRR622461.139 shows a base substitution of undefined->undefined at 10000 seqid: undefined SRR622461.139 shows a base substitution of undefined->undefined at 10031 seqid: undefined got a record named SRR622461.25281037 SRR622461.25281037 shows a base substitution of undefined->undefined at 9937 seqid: undefined SRR622461.25281037 shows a base substitution of undefined->undefined at 9938 seqid: undefined got a record named SRR622461.54611595 SRR622461.54611595 shows a base substitution of undefined->undefined at 10000 seqid: undefined SRR622461.54611595 shows a base substitution of undefined->undefined at 10029 seqid: undefined SRR622461.54611595 shows a base substitution of undefined->undefined at 10073 seqid: undefined SRR622461.54611595 shows a base substitution of undefined->undefined at 10074 seqid: undefined got a record named SRR622461.58377349

cmdcolin commented 5 years ago

It's hard to tell without code but are you passing indexed fasta the sequence name?

shihabdider commented 5 years ago

Oh shoot, I didn't realize that the indexed fasta uses sequence names instead of seqid. I implemented the seqFetch using indexed fasta directly:

const indexedFile = new IndexedCramFile({ cramPath: require.resolve(./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram), index: new CraiIndex({ path: require.resolve(./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram.crai), }), seqFetch: async (seqId, start, end) => { let seq = await t.getSequence(seqId, start, end); }, checkSequenceMD5: false, }) Is there anything in indexed fasta that will let me do the conversion, or should I just pull it from the order in the cram file?

cmdcolin commented 5 years ago

You could actually use indexedfasta's getResiduesById function, which takes a seqId, but that assumes the seqIds from the CRAM file match the seqIds from the FASTA file (which basically implies the sequences are in the same order in the SAM header and the FASTA file).

cmdcolin commented 5 years ago

Er sorry await t.getResiduesById(seqId, start, end);

cmdcolin commented 5 years ago

Also as long as you are testing on NA12878 might consider the pacbio/nanopore data for it

https://s3.amazonaws.com/1000genomes/technical/working/20131209_na12878_pacbio/si/NA12878.pacbio.bwa-mem.20131224.bam https://github.com/nanopore-wgs-consortium/NA12878

shihabdider commented 5 years ago

Unfortunately I'm getting the same error even after replacing with getResiduesById, I suspect this means the order is not preserved between the fasta file and the cram file. However, I don't quite understand why it's reporting the bases as undefined. Here's the full code:

const { IndexedCramFile, CramFile, CraiIndex } = require('@gmod/cram')
const { IndexedFasta, BgzipIndexedFasta } = require('@gmod/indexedfasta')

// open local files
const t = new IndexedFasta({
  path: 'GRCh38_full_analysis_set_plus_decoy_hla.fa',
  faiPath: 'GRCh38_full_analysis_set_plus_decoy_hla.fa.fai',
});

const indexedFile = new IndexedCramFile({
    cramPath: require.resolve(`./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram`),
    index: new CraiIndex({
        path: require.resolve(`./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram.crai`),
    }),
    seqFetch: async (seqId, start, end) => {
        let seq = await t.getResiduesById(seqId, start, end);
    },
    checkSequenceMD5: false,
})

// example of fetching records from an indexed CRAM file.
// NOTE: only numeric IDs for the reference sequence are accepted

run = async () => {
const records = await indexedFile.getRecordsForRange(0, 10000, 20000)

//check seq is string of sequences
let seq = await t.getResiduesById(0, 10000, 10100);
console.log(seq)

Array.from(records).forEach(record => {
    console.log(`got a record named ${record.readName}`)
    record.readFeatures.forEach(({ code, pos, refPos, ref, sub }) => {
    if (code === 'X'){
      console.log(
        `${record.readName} shows a base substitution of ${ref}->${sub} at ${refPos}`,
          'seqid: ', record.sequenceid
      )
  }})
})
}

run()
// can also pass `cramUrl`, and `url` params to open remote URLs

Also I did consider the nanopore set but decided to go with something a little smaller in size.

cmdcolin commented 5 years ago

You have

    seqFetch: async (seqId, start, end) => {
        let seq = await t.getResiduesById(seqId, start, end);
    },

I think you mean

    seqFetch: async (seqId, start, end) => {
        return await t.getResiduesById(seqId, start, end);
    },

🎉 :)

shihabdider commented 5 years ago

Whoops. Yes that seems to have fixed it. I'm just getting one more odd issue, the forEach is coming up as undefined over the record, even though it seems to be pulling the properties of the object correctly:

const { IndexedCramFile, CramFile, CraiIndex } = require('@gmod/cram')
const { IndexedFasta, BgzipIndexedFasta } = require('@gmod/indexedfasta')

// open local files
const t = new IndexedFasta({
  path: 'GRCh38_full_analysis_set_plus_decoy_hla.fa',
  faiPath: 'GRCh38_full_analysis_set_plus_decoy_hla.fa.fai',
});

const indexedFile = new IndexedCramFile({
    cramPath: require.resolve(`./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram`),
    index: new CraiIndex({
        path: require.resolve(`./NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram.crai`),
    }),
    seqFetch: async (seqId, start, end) => {
        return seq = await t.getResiduesById(seqId, start, end);
    },
    checkSequenceMD5: false,
})

// example of fetching records from an indexed CRAM file.
// NOTE: only numeric IDs for the reference sequence are accepted

run = async () => {
const records = await indexedFile.getRecordsForRange(1, 10000, 20000)

Array.from(records).forEach(record => {
    console.log(`got a record named ${record.readName}`)
    record.readFeatures.forEach( ({ code, pos, refPos, data }) => {
      console.log(
        `${record.name} has ${code} and ${data} associated at ${refPos}`,
      )
  })
})
}

run()

Which outputs:

got a record named SRR622461.70654135
undefined has S and CCTAACCCTGA associated at 10238
undefined has D and 1 associated at 10250
undefined has X and 0 associated at 10251
undefined has X and 0 associated at 10270
undefined has X and 2 associated at 10271
undefined has X and 0 associated at 10281
undefined has S and CTACCCATCTCTCACACGGCTCATATCCCAAATCTAC associated at 10292
got a record named SRR622461.90336908
undefined has X and 0 associated at 10364
undefined has D and 1 associated at 10413
undefined has X and 1 associated at 10431
undefined has S and TC associated at 10438
...
(node:28380) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'forEach' of undefined
    at Array.from.forEach.record (/Users/shihabdider/Research/CRAM_JS/test_data/read_cram.js:30:25)
    at Array.forEach (<anonymous>)
    at run (/Users/shihabdider/Research/CRAM_JS/test_data/read_cram.js:28:21)
(node:28380) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:28380) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Thanks so much for your help!

cmdcolin commented 5 years ago

I think some of the reads simply don't have any read features so gotta check for that

shihabdider commented 5 years ago

Yep that fixed it. Thanks again!

GMOD / cram-js

Using example code with local sequence file #18