PacificBiosciences / ANGEL

Robust Open Reading Frame prediction (ANGLE re-implementation)
Other
16 stars 14 forks source link

memory issue while running angel_train.py #20

Closed ghost closed 5 years ago

ghost commented 6 years ago

Dear @Magdoll,

I'm facing a memory issue while running angel_train.py on a 200G MEM server. After running it for about 30 minutes my memory capacity reached it max (200G) and it still haven't finish with the calculation yet. Does ANGEL really need this big amount of memory for only one Iso-sequencing data? I'm not used to Python and not sure if I can limit the memory usage for this purpose. Thanks in advance for any kind of help!

Best, Dewi

Magdoll commented 6 years ago

Hi @dewiang ,

How many input training sequences you have? It's recommended to choose only the top 200 or 500 sequences, which should not take that long or use too much memory.

If this issue persists, let me know.

--Liz

defendant602 commented 5 years ago

Hi @Magdoll : I am facing the same problem as Dewi said while running angel_train.py . I have chosen the top 500 longest sequences as input training dataset, and the memory consumption while running angel_train.py goes up to 112G ! My command is like below:

input is names train.fasta

dumb_predict.py train.fasta train --min_aa_length 300 --cpus 12 angel_make_training_set.py train.final train.final.training --random --cpus 12 angel_train.py train.final.training.cds train.final.training.utr train.final.classifier.pickle --cpus 4 angel_predict.py --cpus 20 isoform.fasta train.final.classifier.pickle Final.predict --min_dumb_aa_length 100

Could you give some advices on how to lower the memory consumption?

Best Regards.

Magdoll commented 5 years ago

Hi @defendant602 ,

in the past I've used 200 sequences to train and obtained the same results as 500 training sequences. Could you please try use 200 instead? If the memory issue still persists, let me know.

Thanks, --Liz

defendant602 commented 5 years ago

Hi @Magdoll ,

I cut the input training sequences down to top 200 longest, and indeed, i obtained the same results as top 500 sequences.(Number of predicted cds sequences with top 500 is 194463 and with top 200 is 193566, little less than with top 500 ) The max memory consumption went down to 53G , almost half of that when running with top 500 sequences(The commands are the same) .

It looks like that the max memory consumption of angel_train.py goes up with increasing input trainning dataset. According to this test, i think it would be enough to use the top 200 longest sequences to train.

Best Regards.