dasmoth / dalliance

Interactive web-based genome browser.
http://www.biodalliance.org/
BSD 2-Clause "Simplified" License
227 stars 68 forks source link

Search doesn't work for BED files with Gene IDs in Non-alphabetical Order #201

Closed ericnelson0 closed 7 years ago

ericnelson0 commented 7 years ago

Some of the bed files we are using in our lab have a string 'nc' out front of the gene ID e.g. "ncRv0001" in order to represent non-coding regions. The typical gene ID looks like: "Rv0001" without the 'nc'. In this case, the search fails to find and highlight the region on the track.

From what I can gather this is due to the way localeCompare is being used in the search algorithm. The lookup function expects the entry to be in one place due to the assumption that the gene IDs are in alphabetical order. However, the 'nc' out front breaks this assumption. One could try to move the 'nc' regions around in the BED file, but in order to get a BigBed file it needs to be in sorted order by coordinate.

Maybe the search should be based on coordinates within the genome instead of the gene IDs?

dasmoth commented 7 years ago

Are you relying just on the .bigBed indexings (with -extraIndex)? Or do you have a Trix index (built with ixIxx) as well?

I suspect what might be going on here is the JS string comparison not quite matching what the UCSC code is doing, but not 100% sure at the moment. Is there any chance I could have a copy of your .bigBed file (or a subset that demonstrates the problem)?

mad-lab commented 7 years ago

@dasmoth Hi, I'm in the same lab as @ericnelson0. I can answer some of these questions.

Are you relying just on the .bigBed indexings (with -extraIndex)? Or do you have a Trix index (built with ixIxx) as well?

We are creating both the bigBed file and the Trix indexes as you describe here:

https://github.com/dasmoth/gtf2bed

We follow the same steps, except the .bed file we start from is created using a custom script because our original annotation isn't in GTF format.

I suspect what might be going on here is the JS string comparison not quite matching what the UCSC code is doing, but not 100% sure at the moment. Is there any chance I could have a copy of your .bigBed file (or a subset that demonstrates the problem)?

Below is a link to zip archive with two example bed files. Both have several genes with the typical ID for the H37Rv genome: "Rvxxxx". One has a single entry with "nc" in the front (e.g. "ncRvxxxx"), and the other has an entry with "nc" at the end. Search fails for the former, but it works for the latter.

test_bed_files.zip

dasmoth commented 7 years ago

Thanks very much for sending me example files. I've made bigBed files (using bedToBigBed) and Trix indices (using ixIxx), and currently searching for the ncRv0005 gene is working file for me using git-latest Biodalliance, both using directed bigBed indexing (i.e. no Trix index), or with the Trix index enabled.

Would it be possible to send bigBeds and Trix indices as well? Also, an example of your Biodalliance source config, just in case there's something odd going on there.

Sorry it's taking a while to get to the bottom of this.

mad-lab commented 7 years ago

Well I don't know why but looks like it's working now.

I was in the process of creating the example trix and bigBed files for you to look at, and I tested it again and it worked. I don't think we did anything different, though. Could it have been some sort cache issue?

Anyways, sorry if we wasted your time.

dasmoth commented 7 years ago

No problem, and glad to hear it's working for you now.

I'll close this issue for now, but feel free to re-open if you see the problem again.