PengNi / basemods_spark

0 stars 0 forks source link

Spark can not serialize object larger than 2G #6

Open PengNi opened 6 years ago

PengNi commented 6 years ago

Spark 2.1.0

a large cmp.h5 file may be created for a repeats region of a reference after using blasr to align. The mean coverage of this repeats region could be 10K or more, and the cmp.h5 file could be 1G, over 2G after unpack.

Thus, the problem is that when spark needs to serialize the object (here is the cmp.h5 data in memory), it will throw an exception that:

error: 'i' format requires -2147483648 <= number <= 2147483647

in line 273, line 562 of pyspark/serializers.py .

It seems that spark tends to use 'i' (for int) directly rather than 'q' (for long) when use struct.pack. And has the limitation of "can not serialize object larger than 2G".

PengNi commented 6 years ago

need to remove part of reads aligned to repeats region.

Should filter/remove reads based on certain rules, e.g. aligned quality.

Check SamFilter in BLASR for some inspiration.

PengNi commented 6 years ago

now we are using 'MapQV' to filter reads (commit edb4d0e).

PengNi commented 6 years ago

still exist this error when a single chunk of the reference has bunch of aligned reads.

For now I haven't figure out a better way, gonna abandon 'MapQV', use 'random' only. And rewrite this part of code. See commit 559c60

1830191044 commented 3 years ago

Have you solved this problem?

PengNi commented 3 years ago

@1830191044 , I didn't figure out how to completely avoid the issue, however I used random partition to lower the chance.