OceanGenomics / mudskipper

A tool for projecting genomic alignments to transcriptomic coordinates
BSD 3-Clause "New" or "Revised" License
32 stars 7 forks source link

Support for overhang alignment #32

Closed yfei-w closed 1 year ago

yfei-w commented 2 years ago

This PR introduces a flag -v or --max-overhang that user can define the number of maximum overhang bases allowed. It will include those alignment that has overhang at the start or at the end within the maximum limit the user set. And change the CIGAR string (change to S) and pos (S doesn't consume query) accordingly.

There are also automated tests created for this feature, but it needs an index folder which is about 50M in size which was included in this PR as well (was created using the annotation file of gencode v35). So the test needs to run for about 12s which is acceptable, but it can be improved by only including the necessary annotations (.gtf) when creating the index folder. But the index folder might be useful in the future.

rob-p commented 2 years ago

Hi @yfei-w,

In general, we should not include data in the repo but should store it somewhere else and pull it in on demand. @gmarcais — any thoughts about where we can put the ~50M of data needed for the tests here, and how to pull it in easily?

gmarcais commented 2 years ago

Regarding the test data, one possible way is: put it in a different repo (say mudskipper_test_data), make it a submodule. If the submodule is not initialized (with git submodule init after cloning), the test data is not available and the test should gracefully be skipped. If available, it should run.