Closed ghost closed 4 years ago
Can you run the following code and share the memory usage to diagnose the possible:
import pyranges
df_gtf = pyranges.read_gtf(gtf_file).df
I think this is related to #18 , we should avoid using pyranges.read_gtf(gtf_file).df as suggested.
@s6juncheng It is not the best practice but i should not consume 100s gb of memory.
Hi @segoerge could you please share the gtf file (gzip) you used? You can post it with a comment. Also the version of pyranges package would be helpful, Many thanks.
Hi @s6juncheng, sure. pyranges 0.0.84
@MuhammedHasan:
Can you run the following code and share the memory usage to diagnose the possible:
import pyranges as pd df_gtf = pyranges.read_gtf(gtf_file).df
I'm getting the same error after about 2 hours. Memory usage goes constantly up until it reaches the 400 GB mark. After that the process aborts. It starts releasing the memory again, when I exit() from Python.
I confirm that the error happens with this particular gtf file. With a regular ensembl gtf file, pyranges.read_gtf(gtf_file).df
finishes within a few minutes.
The memory error does not seem to be related to the fact that pyranges.read_gtf(gtf_file).df
is an expensive operation, since using gr.apply(fun)
creates the same error.
It looks like the error is related to pyranges with this particular gtf file. @endrebak could you please comment the memory error with pyranges.read_gtf(gtf_file).df
on the NCBI gtf file posted above? If this is indeed an error related to pyranges I can open an issue at pyranges repo to help keep track.
I would be happy to! Do you know if other libraries are able to read that file fine?
Endre
Thank you very much Endre. I haven't tried with other libraries. I looked more closely with the problem, it looks like the NCBI gtf file has much more fields than other gtf files (170+ fields compared with typical <30 fields). I guess that's the reason why it crushed with memory error. Reading the whole file also takes a long time (hour) with pyranges. It probably be nice if pyranges can read only selected attributes?
I have a temporary solution by filtering necessary attributes from the gtf file before reading:
awk 'BEGIN{FS="\t";OFS="\t"}; {print $1,$2,$3,$4,$5,$6,$7,$8}' GRCh37_latest_genomic_filtered_withexonid.gtf > left.gft
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";"); print a[1]";"a[2]";"a[3]"; ""exon_id \"NA\""}' GRCh37_latest_genomic_filtered_withexonid.gtf | paste -d"\t" left.gft - > GRCh37_latest_genomic_reduced_withexonid.gtf
After reducing the attributes, pyranges was able to process the file in seconds. @segoerge maybe you can proceed with your work with this temporary solution before we find a better one?
I tried with one awk pass, but it left the Feature column empty for unknown reason:
# This left the Feature column empty, not working
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";"); print $1,$2,S3,$4,$5,$6,$7,$8,a[1]";"a[2]";"a[3]"; ""exon_id \"NA\""}' GRCh37_latest_genomic_filtered_withexonid.gtf > GRCh37_latest_genomic_reduced_withexonid.gtf
Jun
I have come to the conclusion that it does not conform to what GTF usually look like.
Whether it breaks the standard I do not know:
https://www.ensembl.org/info/website/upload/gff.html
It has attribute fields like
transcript_id "NM_001185157.1"; gene_id "IL24"; gene_name "IL24"; db_xref "GeneID:11009"; gbkey "CDS"; gene "IL24"; note "isoform 4 precursor is encoded by transcript variant 4"; product "interleukin-24 isoform 4 precursor"; protein_id "NP_001172086.1"; exon_number "6";
Attribute value fields seldom contain spaces.
@endrebak: For me it's the first time using a GTF from that source. We only have experience with the GFF-Version of this dataset. We are using this for our variant annotation with Ensembl-VEP. There was also some filtering needed. I think we kicked out miRNA entries that VEP couldn't handle.
@s6juncheng: Thanks! Yes, prefiltering is ok for me, cause I have to filter anyway for "protein_coding" entries.
I'm trying to figure out whether the problem came with the original NCBI annotation file or only introduced with the gffread step.
I've found the problem. I've blithely assumed that strings in attribute value fields cannot contain whitespaces. They almost never do, but there is nothing in the spec that disallows this AFAICS. I will fix it.
Btw, GTF-parsing is super slow (since it is an untidy format).
If you were to do something like
gtf_df = pr.read_gtf(f).df
gtf_df.to_parquet("outfile.pq")
and then use outfile.pq
in subsequent invocations of the program you would save lots of time.
Thank you very much for the prompt fix and for suggesting a quick alternative, I will definitely try it.
Thank you very much guys for helping and fixing the issue! I can confirm, it's working now!
Description
Hi, somehow MMSplice is reserving a huge amount of memory (~400GB) and crashing doing this. I think it crashs because of a security setting of Linux not permitting to use this much memory for a single application. But I wonder if the tool really needs this much? We have a strong server, but I want to avoid this.
Best regards, Sebastian
What I Did
I'm using a inhouse WGS case and a GTF file taken from NCBI: https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
I also had to modify the GTF file a bit because you only mentioned official support for Ensembl and Gencode:
With this setting, I did a test on chromosome 1 only and everything was ok. But when I try to use the complete variant and GTF set I'm getting the following abort: