KeremZaman / semantic-sh

semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models (BERT).
MIT License
23 stars 3 forks source link

semantic-sh crash after some document addition #6

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi,

Hope you are all well !

I tried to push 300k abstracts into semantic-sh but the server crash without any debug information at some point.

Is there a way to enable some debug informations ? Do you want me to prepare you a dump of my dataset ?

Thanks in advance for any replies or insights about that.

Cheers, X

KeremZaman commented 3 years ago

Hi, I need more information to understand the problem. What kind of hardware are you using, maybe you are running out of memory? Does it crash after a specific number of documents or is it random? Does it only happen when you run it on Docker? Is there any error message in docker logs? Are you using semantic-sh with something else? If you give details like these, it will be easier to help.

ghost commented 3 years ago

Hi,

I am using the docker gpu image and here is a dump of list of text I try to import: ref: https://paper2code.com/public/semantic-sh_dump.txt.tar.gz

docker logs

semantic-sh    | 51.210.37.251 - - [10/Aug/2020 05:41:52] "GET /api/add?text=We+present+the+results+of+determination+of+the+age%2C+helium+mass+fraction+%28Y%29%2C+metallicity+%28%5BFe%2FH%5D%29%2C+and+abundances+of+the+elements+C%2C+N%2C+O%2C+Na%2C+Mg%2C+Ca%2C+Ti%2C+C+and+Mn+for+the+Galactic+globular+cluster+NGC+6652.+We+use+its+medium-resolution+integrated-light+spectrum+from+the+library+of+Schiavon+and+our+population+synthesis+method+to+fulfill+this+task.+We+select+the+evolutionary+isochrone+and+stellar+mass+function+for+our+analysis%2C+which+provide+the+best+approximation+to+the+shapes+and+intensities+of+the+observed+Balmer+line+profiles.+The+determined+elemental+abundances%2C+age+and+metallicity+are+characteristic+of+stellar+populations+in+the+Galactic+Bulge. HTTP/1.1" 200 -
semantic-sh    | Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
semantic-sh exited with code 137

Here is my config.

MemAvailable: 109164428 kB

CPUs:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
Stepping:            1
CPU MHz:             2394.553
BogoMIPS:            4789.10
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat umip md_clear

GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:00:05.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |   2681MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:00:07.0 Off |                  N/A |
| 23%   25C    P8     9W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:00:09.0 Off |                  N/A |
| 23%   27C    P8     9W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2346      C   python3                                     2671MiB |
+-----------------------------------------------------------------------------+

Cheers, X

KeremZaman commented 3 years ago

This issue couldn't be reproduced after we talked about it in detail. There has been no update since that day, so I'm closing the issue.