Closed audreystott closed 5 years ago
EDIT: this doesn't QUITE work, it causes another error because one gene_id goes missing. I'll try to figure out how to remedy that and put it in the next comment.
Hi Audrey,
I am responsible for the current implementation of the annotation parsing. We mainly rely on upstream functions from the rtracklayer
package which is where the readGFF
function comes from.
The issue here is that the particular annotation you're using has some absurdly long lines, the line in question has >150000 characters, as do a few others in that particular GFF file. rtracklayer
cannot handle this due to size constraints. You can try to remove all the lines that are too long using
gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | awk '{ if (length($0) < 32768) print }' | gzip > GCF_000001735.4_TAIR10.1_genomic_trimmed.gff.gz
Where
gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz
prints the contents of the gzip fileawk '{ if (length($0) < 32768) print }'
prints only lines shorter than 32768 characters (chosen arbitrarily, and seems to work)gzip > GCF_000001735.4_TAIR10.1_genomic_trimmed.gff.gz
streams the output of the above commands into a new gzipped file.You can compare the number of lines by gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | wc -l
and gunzip -c GCF_000001735.4_TAIR10.1_genomic_trimmed.gff.gz | wc -l
. They should differ by fewer than 100 lines, so you aren't losing too much. If you want the entries you filtered out you can always flip the comparison and run
gunzip -c GCF_000001735.4_TAIR10.1_genomic.gff.gz | awk '{ if (length($0) > 32767) print }' | gzip > excluded.gff.gz
Best of luck. Shian
Hi Shian
Thank you for your prompt reply. I was considering removing those lines but thought I would check with you guys first. I shall process my data without them for now as per your suggestion, and also check with my supervisor on whether it is a concern to have those annotations matched. Thank you once again.
Kind regards Audrey
Hi Luyi
I came across this issue while running the sc_count_aligned_bam function. The error message I got is this:
Error in readGFF(filepath, version = version, filter = filter) : reading GFF file: cannot read line 631680, line is too long
My gff3 file is from NCBI - (https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn]). Any assistance is much appreciated. Thank you.
Audrey