Etwas-Builders / Twitter-Source-Bot

Ever wanted to know the source of a tweet? Just @whosaidthis_bot and I'll tell you where it came from
https://twittersourcebot.tech/
15 stars 1 forks source link

Processing Large Files #50

Open CryogenicPlanet opened 4 years ago

CryogenicPlanet commented 4 years ago

From @rithvikmahin #24

Cause of issue: The sources found include PDF documents and academic papers that are very long, with over 1.5 million characters. SpaCy takes too long (over 30 seconds) to run nlp(text) and create a document object from the text and stalls the entire processing system.

Temporary solution: Created a timer that stops processing that document if it takes longer than 30 seconds and moves on to the next one.

Potential solution / TODO: Add a queue for all tweets that take longer than 30 seconds to process, and return a "Will provide the source later" statement to the user. Once the tweets are processed, return them to the user at any point in time later.

CryogenicPlanet commented 4 years ago

I had a potential solution to this and wanted your opinion @lunaroyster on it. For right now, we have a 30 second timeout and then we say we cannot cite it but I was thinking we could make a special case for these large files and add them to some sort of like queue in the background and process them when there are no other active tweets so that they can take as much time as they need.

So this would involve have a third promise statement, right now we have one of the sources resolves or none of them do, but we would need a third where one of them is classified as this large source. At this point we thought we could reply to the user saying hey we couldn't do it now but we'll continue trying or something.

Question 1: How would we implement this third promise state?

After this I was thinking we make a queue of these large sources and slowly go through them and pause them if a new tweet comes in.

Question 2: What's a good way to do this?

So @lunaroyster what do you think of this overall idea? Any better ways to fix the problem? Could we potentially use some decentralized thing to do this?

Could use those cheap fading repl kinda of servers to get more compute power for these and do them?