cole-trapnell-lab / cicero-release

https://cole-trapnell-lab.github.io/cicero-release/
MIT License
56 stars 14 forks source link

annotate_cds_by_site error with 10x genomics reference genome #56

Closed jindalk closed 4 years ago

jindalk commented 4 years ago

Hi, I am trying to use Ciecro to calculate gene activity score. Upon running the following command: input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub) I am getting this error (with traceback):

NAs introduced by coercionNAs introduced by coercionError in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : In range 59224: at least two out of 'start', 'end', and 'width', must be supplied.

After digging around a bit, i noticed that the error is because the current implementation of ranges_for_coords is unable to handle 10x genomics scATAC refdata chromosome names.

The line, coord_cols <- stringr::str_split_fixed(coord_strings, ":|-|_", 3) splits strings into 3 parts by "_"

This fails for coordinates like chr4_GL456216_random_13926_14126 as it returns chr4, GL456216, random_13926_14126

Perhaps the constraint of returning 3 values can be removed from the str_split_fixed and instead the last 3 values of whatever list is returned can be taken?

For now I'm trying to use a subset of the input to exclude such cases:

input_cds <- subset(input_cds, !grepl("_",fData(input_cds)$chr))

Thanks

hpliner commented 4 years ago

Hi @jindalk

Thanks for the report. You're right that the random reads won't work in cicero at the moment. It's been on the list to fix for a while but I haven't gotten to it yet, thanks for the reminder! Hopefully I'll get to it soon.

hpliner commented 4 years ago

Hi @jindalk - Finally got around to this, if you install the latest, it should be fixed!

jindalk commented 4 years ago

Thanks very much!