cran2367 / sgt

Sequence Graph Transform
104 stars 21 forks source link

Big dataframe of sequences #5

Closed trikiamine23 closed 4 years ago

trikiamine23 commented 4 years ago

I am having an issue with size of the dataframe. I have 800 000 different sequences. The multiprocessing works fine, but then it stops and stays with no response. Is it related to the SGT or to pandarallel ?

cran2367 commented 4 years ago

What is the alphabets set size and the average length of the sequences?

trikiamine23 commented 4 years ago

Alphabet : 255 Average length of sequences: 12

cran2367 commented 4 years ago

It is likely that your system is reaching its computation limit that is causing the process to hang. The current SGT2.x has the algorithm implemented that is efficient if sequence length < alphabet size. The next version will have another algorithm implemented that is efficient for your case. It will be released in a few months. For now, I suggest to try breaking the 800k dataset into chunks and apply SGT.

trikiamine23 commented 4 years ago

Thank you very much, I will wait for the next release ! Your work is very interesting

cran2367 commented 4 years ago

@trikiamine23 thank you for your note!