KeremZaman / semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).
MIT License
24 stars 3 forks source link

get hash endpoint error #5

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi,

Hope you are all well !

I tried to get the hash of an abstract and it triggers the following error:

semantic-sh_1                 |  * Tip: There are .env or .flaskenv files present. Do "pip install python-dotenv" to use them.
semantic-sh_1                 |  * Serving Flask app "server" (lazy loading)
semantic-sh_1                 |  * Environment: production
semantic-sh_1                 |    WARNING: This is a development server. Do not use it in a production deployment.
semantic-sh_1                 |    Use a production WSGI server instead.
semantic-sh_1                 |  * Debug mode: off
semantic-sh_1                 | /opt/service/semantic_sh/semantic_sh.py:51: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
semantic-sh_1                 |   return np.vstack((np.random.normal(0, 1, dim) for i in range(0, key_size)))
semantic-sh_1                 |  * Running on http://0.0.0.0:5001/ (Press CTRL+C to quit)
semantic-sh_1                 | Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
semantic-sh_1                 | [2020-08-09 06:48:59,687] ERROR in app: Exception on /api/hash [GET]
semantic-sh_1                 | Traceback (most recent call last):
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
semantic-sh_1                 |     response = self.full_dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request
semantic-sh_1                 |     rv = self.handle_user_exception(e)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception
semantic-sh_1                 |     reraise(exc_type, exc_value, tb)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
semantic-sh_1                 |     raise value
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
semantic-sh_1                 |     rv = self.dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
semantic-sh_1                 |     return self.view_functions[rule.endpoint](**req.view_args)
semantic-sh_1                 |   File "./server.py", line 20, in generate_hash
semantic-sh_1                 |     return hex(sh.get_hash(txt))
semantic-sh_1                 |   File "/opt/service/semantic_sh/semantic_sh.py", line 88, in get_hash
semantic-sh_1                 |     y = np.matmul(self._proj, enc)
semantic-sh_1                 | ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 768 is different from 300)
semantic-sh_1                 | 51.210.37.251 - - [09/Aug/2020 06:48:59] "GET /api/hash?text=Recent+work+has+demonstrated+substantial+gains+on+many+NLP+tasks+and+benchmarks+by+pre-training+on+a+large+corpus+of+text+followed+by+fine-tuning+on+a+specific+task.+While+typically+task-agnostic+in+architecture%2C+this+method+still+requires+task-specific+fine-tuning+datasets+of+thousands+or+tens+of+thousands+of+examples.+By+contrast%2C+humans+can+generally+perform+a+new+language+task+from+only+a+few+examples+or+from+simple+instructions+-+something+which+current+NLP+systems+still+largely+struggle+to+do.+Here+we+show+that+scaling+up+language+models+greatly+improves+task-agnostic%2C+few-shot+performance%2C+sometimes+even+reaching+competitiveness+with+prior+state-of-the-art+fine-tuning+approaches.+Specifically%2C+we+train+GPT-3%2C+an+autoregressive+language+model+with+175+billion+parameters%2C+10x+more+than+any+previous+non-sparse+language+model%2C+and+test+its+performance+in+the+few-shot+setting.+For+all+tasks%2C+GPT-3+is+applied+without+any+gradient+updates+or+fine-tuning%2C+with+tasks+and+few-shot+demonstrations+specified+purely+via+text+interaction+with+the+model.+GPT-3+achieves+strong+performance+on+many+NLP+datasets%2C+including+translation%2C+question-answering%2C+and+cloze+tasks%2C+as+well+as+several+tasks+that+require+on-the-fly+reasoning+or+domain+adaptation%2C+such+as+unscrambling+words%2C+using+a+novel+word+in+a+sentence%2C+or+performing+3-digit+arithmetic.+At+the+same+time%2C+we+also+identify+some+datasets+where+GPT-3%27s+few-shot+learning+still+struggles%2C+as+well+as+some+datasets+where+GPT-3+faces+methodological+issues+related+to+training+on+large+web+corpora.+Finally%2C+we+find+that+GPT-3+can+generate+samples+of+news+articles+which+human+evaluators+have+difficulty+distinguishing+from+articles+written+by+humans.+We+discuss+broader+societal+impacts+of+this+finding+and+of+GPT-3+in+general. HTTP/1.1" 500 -

Any idea how to sort it ? Is it related to the server configuration ?

Cheers, X

ghost commented 3 years ago

The same will trying to add a text.

semantic-sh_1                 | Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
semantic-sh_1                 | [2020-08-09 06:53:19,438] ERROR in app: Exception on /api/add [GET]
semantic-sh_1                 | Traceback (most recent call last):
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
semantic-sh_1                 |     response = self.full_dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1952, in full_dispatch_request
semantic-sh_1                 |     rv = self.handle_user_exception(e)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1821, in handle_user_exception
semantic-sh_1                 |     reraise(exc_type, exc_value, tb)
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
semantic-sh_1                 |     raise value
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1950, in full_dispatch_request
semantic-sh_1                 |     rv = self.dispatch_request()
semantic-sh_1                 |   File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1936, in dispatch_request
semantic-sh_1                 |     return self.view_functions[rule.endpoint](**req.view_args)
semantic-sh_1                 |   File "./server.py", line 26, in add
semantic-sh_1                 |     sh.add_document(txt)
semantic-sh_1                 |   File "/opt/service/semantic_sh/semantic_sh.py", line 96, in add_document
semantic-sh_1                 |     h = self.get_hash(txt)
semantic-sh_1                 |   File "/opt/service/semantic_sh/semantic_sh.py", line 88, in get_hash
semantic-sh_1                 |     y = np.matmul(self._proj, enc)
semantic-sh_1                 | ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 768 is different from 300)
semantic-sh_1                 | 51.210.37.251 - - [09/Aug/2020 06:53:19] "GET /api/add?text=Recent+work+has+demonstrated+substantial+gains+on+many+NLP+tasks+and+benchmarks+by+pre-training+on+a+large+corpus+of+text+followed+by+fine-tuning+on+a+specific+task.+While+typically+task-agnostic+in+architecture%2C+this+method+still+requires+task-specific+fine-tuning+datasets+of+thousands+or+tens+of+thousands+of+examples.+By+contrast%2C+humans+can+generally+perform+a+new+language+task+from+only+a+few+examples+or+from+simple+instructions+-+something+which+current+NLP+systems+still+largely+struggle+to+do.+Here+we+show+that+scaling+up+language+models+greatly+improves+task-agnostic%2C+few-shot+performance%2C+sometimes+even+reaching+competitiveness+with+prior+state-of-the-art+fine-tuning+approaches.+Specifically%2C+we+train+GPT-3%2C+an+autoregressive+language+model+with+175+billion+parameters%2C+10x+more+than+any+previous+non-sparse+language+model%2C+and+test+its+performance+in+the+few-shot+setting.+For+all+tasks%2C+GPT-3+is+applied+without+any+gradient+updates+or+fine-tuning%2C+with+tasks+and+few-shot+demonstrations+specified+purely+via+text+interaction+with+the+model.+GPT-3+achieves+strong+performance+on+many+NLP+datasets%2C+including+translation%2C+question-answering%2C+and+cloze+tasks%2C+as+well+as+several+tasks+that+require+on-the-fly+reasoning+or+domain+adaptation%2C+such+as+unscrambling+words%2C+using+a+novel+word+in+a+sentence%2C+or+performing+3-digit+arithmetic.+At+the+same+time%2C+we+also+identify+some+datasets+where+GPT-3%27s+few-shot+learning+still+struggles%2C+as+well+as+some+datasets+where+GPT-3+faces+methodological+issues+related+to+training+on+large+web+corpora.+Finally%2C+we+find+that+GPT-3+can+generate+samples+of+news+articles+which+human+evaluators+have+difficulty+distinguishing+from+articles+written+by+humans.+We+discuss+broader+societal+impacts+of+this+finding+and+of+GPT-3+in+general. HTTP/1.1" 500 -

Also, won't it be better to use POST method for this 2 endpoints ?

KeremZaman commented 3 years ago

It seems like you are using BERT as model but not changing default dim parameter to 768 from 300.

Also you're right about using POST method, it'd be definitely better.

ghost commented 3 years ago

Works, thanks