LuyiTian / scPipe

a pipeline for single cell RNA-seq data analysis
69 stars 24 forks source link

Error in readGFF #116

Closed audreystott closed 5 years ago

audreystott commented 5 years ago

Hi Luyi

I came across this issue while running the sc_count_aligned_bam function. The error message I got is this:

Error in readGFF(filepath, version = version, filter = filter) : reading GFF file: cannot read line 631680, line is too long

My gff3 file is from NCBI - (https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn]). Any assistance is much appreciated. Thank you.

Audrey

Shians commented 5 years ago

EDIT: this doesn't QUITE work, it causes another error because one gene_id goes missing. I'll try to figure out how to remedy that and put it in the next comment.

Hi Audrey,

I am responsible for the current implementation of the annotation parsing. We mainly rely on upstream functions from the rtracklayer package which is where the readGFF function comes from.

The issue here is that the particular annotation you're using has some absurdly long lines, the line in question has >150000 characters, as do a few others in that particular GFF file. rtracklayer cannot handle this due to size constraints. You can try to remove all the lines that are too long using

gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | awk '{ if (length($0) < 32768) print }' | gzip > GCF_000001735.4_TAIR10.1_genomic_trimmed.gff.gz

Where

You can compare the number of lines by gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | wc -l and gunzip -c GCF_000001735.4_TAIR10.1_genomic_trimmed.gff.gz | wc -l. They should differ by fewer than 100 lines, so you aren't losing too much. If you want the entries you filtered out you can always flip the comparison and run

gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | awk '{ if (length($0) > 32767) print }' | gzip > excluded.gff.gz

Best of luck. Shian

audreystott commented 5 years ago

Hi Shian

Thank you for your prompt reply. I was considering removing those lines but thought I would check with you guys first. I shall process my data without them for now as per your suggestion, and also check with my supervisor on whether it is a concern to have those annotations matched. Thank you once again.

Kind regards Audrey