MemoryError - Githubissues

ghost commented 4 years ago

MMSplice version: 2.2.1
Python version: 3.7.1
Operating System: Debian 10

Description

Hi, somehow MMSplice is reserving a huge amount of memory (~400GB) and crashing doing this. I think it crashs because of a security setting of Linux not permitting to use this much memory for a single application. But I wonder if the tool really needs this much? We have a strong server, but I want to avoid this.

Best regards, Sebastian

What I Did

I'm using a inhouse WGS case and a GTF file taken from NCBI: https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/

I also had to modify the GTF file a bit because you only mentioned official support for Ensembl and Gencode:

I replaced the NCBI chromosome names with 1,2,3....,X,Y
I filtered for protein coding genes with the tool gffread (https://github.com/gpertea/gffread). The tool also does some cleanup. Command was the following: gffread GRCh37_latest_genomic.gtf -F -T --keep-genes --keep-exon-attrs -C -o GRCh37_latest_genomic_filtered.gtf
The NCBI GTF has no exon_id fields. Hence I added an articial exon_id field for each exon to prevent MMSplice for crashing.

With this setting, I did a test on chromosome 1 only and everything was ok. But when I try to use the complete variant and GTF set I'm getting the following abort:

Traceback (most recent call last):
  File "mmsplicetest.py", line 13, in <module>
    dl = SplicingVCFDataloader(gtf, fasta, vcf, tissue_specific=False)
  File "/test/mmsplice/lib/python3.7/site-packages/mmsplice/vcf_dataloader.py", line 123, in __init__
    pr_exons = self._read_exons(gtf, overhang)
  File "/test/mmsplice/lib/python3.7/site-packages/mmsplice/vcf_dataloader.py", line 140, in _read_exons
    return read_exon_pyranges(gtf, overhang=overhang)
  File "/test/mmsplice/lib/python3.7/site-packages/mmsplice/vcf_dataloader.py", line 29, in read_exon_pyranges
    df_gtf = pyranges.read_gtf(gtf_file).df
  File "/test/mmsplice/lib/python3.7/site-packages/pyranges/readers.py", line 301, in read_gtf
    gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
  File "/test/mmsplice/lib/python3.7/site-packages/pyranges/readers.py", line 337, in read_gtf_full
    df.loc[:, "Start"] = df.Start - 1
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/indexing.py", line 670, in __setitem__
    iloc._setitem_with_indexer(indexer, value)
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/indexing.py", line 1542, in _setitem_with_indexer
    take_split_path = self.obj._is_mixed_type
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/generic.py", line 5232, in _is_mixed_type
    return self._protect_consolidate(f)
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/generic.py", line 5194, in _protect_consolidate
    result = f()
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/generic.py", line 5231, in <lambda>
    f = lambda: self._mgr.is_mixed_type
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 691, in is_mixed_type
    self._consolidate_inplace()
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 980, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1900, in _consolidate
    list(group_blocks), dtype=dtype, can_consolidate=_can_consolidate
  File "/test/mmsplice/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1922, in _merge_blocks
    new_values = np.vstack([b.values for b in blocks])
  File "<__array_function__ internals>", line 6, in vstack
  File "/test/mmsplice/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 6, in concatenate
MemoryError: Unable to allocate 401. GiB for an array with shape (43752, 1231210) and data type object

MuhammedHasan commented 4 years ago

Can you run the following code and share the memory usage to diagnose the possible:

import pyranges 
df_gtf = pyranges.read_gtf(gtf_file).df

s6juncheng commented 4 years ago

I think this is related to #18 , we should avoid using pyranges.read_gtf(gtf_file).df as suggested.

MuhammedHasan commented 4 years ago

@s6juncheng It is not the best practice but i should not consume 100s gb of memory.

s6juncheng commented 4 years ago

Hi @segoerge could you please share the gtf file (gzip) you used? You can post it with a comment. Also the version of pyranges package would be helpful, Many thanks.

ghost commented 4 years ago

Hi @s6juncheng, sure. pyranges 0.0.84

GRCh37_latest_genomic_filtered_withexonid.gtf.gz

ghost commented 4 years ago

@MuhammedHasan:

Can you run the following code and share the memory usage to diagnose the possible:
import pyranges as pd
df_gtf = pyranges.read_gtf(gtf_file).df

I'm getting the same error after about 2 hours. Memory usage goes constantly up until it reaches the 400 GB mark. After that the process aborts. It starts releasing the memory again, when I exit() from Python.

s6juncheng commented 4 years ago

I confirm that the error happens with this particular gtf file. With a regular ensembl gtf file, pyranges.read_gtf(gtf_file).df finishes within a few minutes.

The memory error does not seem to be related to the fact that pyranges.read_gtf(gtf_file).df is an expensive operation, since using gr.apply(fun) creates the same error.

It looks like the error is related to pyranges with this particular gtf file. @endrebak could you please comment the memory error with pyranges.read_gtf(gtf_file).df on the NCBI gtf file posted above? If this is indeed an error related to pyranges I can open an issue at pyranges repo to help keep track.

endrebak commented 4 years ago

I would be happy to! Do you know if other libraries are able to read that file fine?

Endre

s6juncheng commented 4 years ago

Thank you very much Endre. I haven't tried with other libraries. I looked more closely with the problem, it looks like the NCBI gtf file has much more fields than other gtf files (170+ fields compared with typical <30 fields). I guess that's the reason why it crushed with memory error. Reading the whole file also takes a long time (hour) with pyranges. It probably be nice if pyranges can read only selected attributes?

I have a temporary solution by filtering necessary attributes from the gtf file before reading:

awk 'BEGIN{FS="\t";OFS="\t"}; {print $1,$2,$3,$4,$5,$6,$7,$8}' GRCh37_latest_genomic_filtered_withexonid.gtf > left.gft

awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";"); print a[1]";"a[2]";"a[3]"; ""exon_id \"NA\""}' GRCh37_latest_genomic_filtered_withexonid.gtf | paste -d"\t" left.gft - > GRCh37_latest_genomic_reduced_withexonid.gtf

After reducing the attributes, pyranges was able to process the file in seconds. @segoerge maybe you can proceed with your work with this temporary solution before we find a better one?

I tried with one awk pass, but it left the Feature column empty for unknown reason:

# This left the Feature column empty, not working
awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,";"); print $1,$2,S3,$4,$5,$6,$7,$8,a[1]";"a[2]";"a[3]"; ""exon_id \"NA\""}' GRCh37_latest_genomic_filtered_withexonid.gtf > GRCh37_latest_genomic_reduced_withexonid.gtf

Jun

endrebak commented 4 years ago

I have come to the conclusion that it does not conform to what GTF usually look like.

Whether it breaks the standard I do not know:

https://www.ensembl.org/info/website/upload/gff.html

It has attribute fields like

transcript_id "NM_001185157.1"; gene_id "IL24"; gene_name "IL24"; db_xref "GeneID:11009"; gbkey "CDS"; gene "IL24"; note "isoform 4 precursor is encoded by transcript variant 4"; product "interleukin-24 isoform 4 precursor"; protein_id "NP_001172086.1"; exon_number "6";

Attribute value fields seldom contain spaces.

ghost commented 4 years ago

@endrebak: For me it's the first time using a GTF from that source. We only have experience with the GFF-Version of this dataset. We are using this for our variant annotation with Ensembl-VEP. There was also some filtering needed. I think we kicked out miRNA entries that VEP couldn't handle.

@s6juncheng: Thanks! Yes, prefiltering is ok for me, cause I have to filter anyway for "protein_coding" entries.

s6juncheng commented 4 years ago

I'm trying to figure out whether the problem came with the original NCBI annotation file or only introduced with the gffread step.

endrebak commented 4 years ago

I've found the problem. I've blithely assumed that strings in attribute value fields cannot contain whitespaces. They almost never do, but there is nothing in the spec that disallows this AFAICS. I will fix it.

endrebak commented 4 years ago

I'll push this fix to PyPI later today. Thanks for pinging me.

endrebak commented 4 years ago

Btw, GTF-parsing is super slow (since it is an untidy format).

If you were to do something like

gtf_df = pr.read_gtf(f).df
gtf_df.to_parquet("outfile.pq")

and then use outfile.pq in subsequent invocations of the program you would save lots of time.

s6juncheng commented 4 years ago

Thank you very much for the prompt fix and for suggesting a quick alternative, I will definitely try it.

ghost commented 4 years ago

Thank you very much guys for helping and fixing the issue! I can confirm, it's working now!

gagneurlab / MMSplice_MTSplice

MemoryError #43

Description

What I Did