Closed keien closed 10 years ago
what command gives you this error?
run_pipeline
on r_and_j
Do you know if CoreNLP has a max sentence length or something? Or maybe it's unhappy with the newline characters at the end?
I don't think it has a max sentence length... I'm going to take a look at this right now.
hang on I think I need to push my branch up first
Never mind it's fine; the branch is removing-readerwriter
by the way.
There's also a lot of unit tests not passing, some of which doesn't make sense to me because the branch is already merged from master
Oh, heh, I see what's happening here.
The original code had a limitation in place for length of a sentence, probably to conserve resources.
The sentence it chokes on is this:
you men, you beasts, That quench the fire of your pernicious rage With purple fountains issuing from your veins, On pain of torture, from those bloody hands Throw your mistemper'd weapons to the ground, And hear the sentence of your moved prince.
Found on line 388.
Why would the unit tests be passing? They were failing when you branched this branch from handling-duplicates
. For the most part the fixes shouldn't be too bad.
We could increase the limit or remove it entirely and hope that nobody feeds in a sentence so long that it breaks something.
Most of them are actually just these:
DetachedInstanceError: Instance <User at 0x7f0a1b601710> is not bound to a Session; attribute refresh operation cannot proceed
I don't even know what this means.
We should remove the limit but add a memory check or something so that if a sentence is too long and can't be processed, we throw an error and let the user know.
This would be much easier if I understood how mock
works...do you have something I can refer to to figure it out?
Those crop up when another unit test is breaking, it's a sort of domino effect. It's best to ignore those and fix the other problems, they'll fix themselves.
Well, the limit is pretty much a memory check; perhaps instead of just failing the whole pipeline it could just go to the next sentence.
Here's some documentation for mock
.
The problem is that because we intend this to be run locally, we can't predict everyone's memory limits. This is also something the duplication handler can potentially have an issue with; the dictionary used for look-ups refreshes every 50 sentences, but it could break if there's not enough memory to store all of that.
Also, maybe we should update the tests (and method names) because they don't all do the same thing as when the tests were written. For example, the return value of StringProcessor.parse
doesn't really matter because we write everything to the database on the fly now; same with StringProcessor.tokenize_from_raw
, though for that one it's probably convenient to return the list of sentences.
How would your solution address not knowing everyone's memory limits?
I think you're right that parse
no longer needs to return anything; however, tokenize_from_raw
definitely does need to return things - it and tokenize
used in lots of places that need to get sentences and process them rather than writing to database straightaway.
I'm not sure, really. I'd have to look into how Python handles memory issues. Is there a temporary solution we can put in place for now?
Yes, for now let's just remove the check since we won't be handling any unreasonable data. I'll do that and make an issue to make a better check.
kk sounds good
I'll go ahead and close this issue.
"Fix" has been pushed to removing-readerwriter
.
I just merged in the sentence combining into my branch and I'm getting this error: