collab-uniba / Senti4SD

An emotion-polarity classifier specifically trained on developers' communication channels
http://collab.di.uniba.it/research
MIT License
49 stars 18 forks source link

Futures timed out after [24 hours] #6

Closed iscas-lee closed 5 years ago

iscas-lee commented 5 years ago

I have a large file with almost 200k lines. When I run the Senti4SD it takes more than 24 hours and then it displays the error message "Futures timed out after [24 hours]". Could you please help me how to solve this problem.

bateman commented 5 years ago

does it work with a smaller subset? --

Sent from iPhone

iscas-lee commented 5 years ago

Yes, this software is working fine for small size file. But, If the running time exceeds 24 hours, "Futures timed out after [24 hours]" will be displayed.

bateman commented 5 years ago

Sorry for the late reply. We are aware of the issue with large files. This is, however, a limitation of R itself. So, we need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory. Still, we do not have time immediately to fix this issue -- we're busy teaching and all right now -- nor we have a student working on it at this very moment. If you are in a hurry, I suggest you read this [1] and [2], which give you an idea of how to resolve the problem. The easiest is to use ff library if your dataframe contains heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library will do. The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

Should you decide to update the script yourself, a pull request would be very much appreciated! ;-)

HTH,

[1] https://rpubs.com/msundar/large_data_analysis [2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/ [3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

bateman commented 5 years ago

See issue #7