Closed dcmouth closed 2 years ago
I got the reason. that when I run install_external_tools.sh. the InstallSentencePiece () got error which update latest .how can install correctly?
Hi @dcmouth, can you provide more details on the error you received when you ran ./install_external_tools.sh
? Is there an error log?
hi
@heffernankevin
i met the same error when i run the embed.sh
2022-07-20 14:10:19,484 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb/laser2.spm
2022-07-20 14:10:19,484 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab
2022-07-20 14:10:19,484 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb/laser2.pt
2022-07-20 14:10:19,926 | INFO | preprocess | SPM processing doc.zh.txt
2022-07-20 14:10:19,991 | ERROR | preprocess | /bin/bash: /Users/liguangyu/LASER/tools-external/sentencepiece-master/build/src//spm_encode: No such file or directory
Exception ignored in: <_io.TextIOWrapper name='
Hi @guangyuli-uoe, was there an error when you installed the external tools? If so, can you comment it here? Also it appears that the segmentation fault issue is no longer occurring? (since from the logs it appears you're able to load the encoder successfully).
it seems there are only some warnings,
by the way, segmentation fault has been solved after updating the pytorch ! really thanks ^^
@guangyuli-uoe great to hear the segmentation fault issue has been resolved! In your error logs, there is the following error:
cmake: command not found
when attempting to build the binaries. You can install cmake
and try again? (e.g., if on osx, brew install cmake
)
@dcmouth perhaps this is the same issue you faced?
@heffernankevin
really thanks for your kind remind !
i can run the script on my laptop now ! ^^
could i run the codes on gpu directly ?
@heffernankevin
really thanks for your kind remind !
i can run the script on my laptop now ! ^^
could i run the codes on gpu directly ?
@guangyuli-uoe that's great! Yes you should be to run the embedding generation on the GPU (if CUDA is available).
hi, @heffernankevin
there is a cmake error when run install_external_tools.sh on gpu, which is quite weird...
finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt
hi, @heffernankevin
there is a cmake error when run install_external_tools.sh on gpu, which is quite weird...
finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt
- building code -- VERSION: 0.1.97 -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so -- Configuring done CMake Error in src/CMakeLists.txt: Target "sentencepiece" requires the language dialect "CXX17" (with compiler extensions), but CMake does not know the compile flags to use to enable it.
Hi @guangyuli-uoe, there shouldn't be a need to recompile the external tools on GPU. You can keep your existing binaries.
Hi @heffernankevin
log.txt
,
Can I trouble you to see where the problem is
Hi @heffernankevin log.txt , Can I trouble you to see where the problem is
Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to gcc 7
as it's having trouble finding string_view
. Can you try upgrading and then re-running the script?
Also I noticed you previously had an issue downloading moses scripts (https://github.com/facebookresearch/LASER/issues/171) . Has this been resolved?
hi @heffernankevin
i want to align sentences in Chinese and sentence in English
here i firstly embed them into 2 files by embed.py
then i think i should compute the similarities in sentence-level,
so for the instruction in xsim, i noticed this strategy, here the "A" "B" are path to the two files ?
fp16_flag = False # set true if embeddings are saved in 16 bit embedding_dim = 1024 # set dimension of saved embeddings err, nbex = xSIM( x, y, dim=embedding_dim, fp16=fp16_flag )
Hi @guangyuli-uoe, yes that would be the correct way to calculate xsim for your embedding files.
hi, @heffernankevin
when i run bucc task on gpu,
a new error occurred TT
Hi @guangyuli-uoe, thanks for providing the log. For some reason you don't have permission to run the moses scripts. How did you download these? Can you try the following:
ls -lh LASER/tools-external/moses-tokenizer/tokenizer/*.perl
(what is the output)
You could also try to edit them e.g. chmod 775
to give you executable access, but this shouldn't be required.
hi, @heffernankevin
when i run the command, the output is as follows: ls: cannot access 'LASER/tools-external/moses-tokenizer/tokenizer/*.perl': No such file or directory
hi, @heffernankevin
when i tried to download the external-tool yesterday,
it has problem: CMake Error in src/CMakeLists.txt:
(and you told me i could use the existed one, so i just copied the directory)
inflating: sentencepiece-master/third_party/protobuf-lite/zero_copy_stream_impl_lite.cc
finishing deferred symbolic links:
sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt
hi, @heffernankevin
when i run the command, the output is as follows: ls: cannot access 'LASER/tools-external/moses-tokenizer/tokenizer/*.perl': No such file or directory
Hi @guangyuli-uoe, thanks for providing the clarification. As your GPU devices are located on another machine, in this case you will need to re-run the external tools script there as well to rebuild the binaries etc. Looking at your error, it seems like you may need to upgrade CMake. Perhaps this thread might help!
hi, @heffernankevin
it seems that the version of my cmake is the latest,
and now i am trying to update my g++( g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)) and gcc (gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)), because their versions is quite old,,
but i do not have the sudo authority, TT
do you have any idea or suggestions about that ?
hi, @heffernankevin
it seems that the version of my cmake is the latest,
and now i am trying to update my g++( g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)) and gcc (gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)), because their versions is quite old,,
but i do not have the sudo authority, TT
do you have any idea or suggestions about that ?
@guangyuli-uoe I think you will might need to reach out to your system administrator to solve this. In the meantime, does the task run in the cpu setting? (i.e. on your laptop?)
hi, @heffernankevin
sorry for my late reply, i am still trying to upgrade the gcc,
i think i could run the task in my laptop,
================= here is my question about xsim
for the text a in Chinese, text b in English,
i embed them into a_emd and b_emd,
then i call the function:
fp16_flag = False # set true if embeddings are saved in 16 bit embedding_dim = 1024 # set dimension of saved embeddings err, nbex = xSIM( a_emd b_emd, dim=embedding_dim, fp16=fp16_flag, margin='absolute' )
i know that err is the number of error sentences, and nbex is the number of totoal sentences ( if my understand is right ==) thus i could calculate the error rate,
but how to get the alignment index for sentences ?
Hi @guangyuli-uoe, you should be able to use $LASER/source/mine_bitexts.py
to do this. For example:
python mine_bitexts.py file_A file_B
--src-lang lang_A --trg-lang lang_B
--output alignments.tsv
--mode mine --verbose
--src-embeddings embeddings_A
--trg-embeddings embeddings_B
hi, @heffernankevin
i update gcc and g++ to 9.4 on gpu, which is ok for download the external_tool (i test it in cpu)
finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt
building code
-- VERSION: 0.1.97
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
-- Configuring done
-- Generating done
-- Build files have been written to: /home/LASER/tools-external/sentencepiece-master/build
Scanning dependencies of target sentencepiece_train-static
Scanning dependencies of target sentencepiece-static
Scanning dependencies of target sentencepiece
[ 1%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unicode_script.cc.o
......
but it still fails:
finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt
building code
-- VERSION: 0.1.97
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so
-- Configuring done
CMake Error in src/CMakeLists.txt:
Target "sentencepiece_train-static" requires the language dialect "CXX17"
(with compiler extensions), but CMake does not know the compile flags to
use to enable it.
and here are my current versions (on gpu), but you can find above that : it says the GNU version is 4.8.5
gcc (GCC) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE
Hi @guangyuli-uoe, it looks like the second example above (where the error occurs) may have been run on a different machine since it has an older version of gcc installed. Is this the case? If so, are you able to upgrade it there as well?
hi, @heffernankevin
i have double-checked the version, and i think it is the new version (on gpu),
but i do now know why the identified version is the old one for the script,
i just download the new version and make install and source it but did not remove the old one,
intuitively the old version will be replaced by the new version ?
hi, @heffernankevin
this is the locations
(base) [uhtred]username: which gcc /home/username/gcc-9.4/bin/gcc (base) [uhtred]username: which make /usr/bin/make
but i noticed that the location of make and gcc should be in the same directory ? here are results online:
devops@devops-osetc:~$ which gcc /usr/bin/gcc devops@devops-osetc:~$ which g++ /usr/bin/g++ devops@devops-osetc:~$ which make /usr/bin/make
Hi @heffernankevin log.txt , Can I trouble you to see where the problem is
Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to
gcc 7
as it's having trouble findingstring_view
. Can you try upgrading and then re-running the script?Also I noticed you previously had an issue downloading moses scripts (#171) . Has this been resolved?
thanks a lot for your attention.the issue above which 171 has been resolved. I upgrad to "gcc 7",and upgrad to "g++7 " and this issue got resolved
Hi @heffernankevin log.txt , Can I trouble you to see where the problem is
Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to
gcc 7
as it's having trouble findingstring_view
. Can you try upgrading and then re-running the script? Also I noticed you previously had an issue downloading moses scripts (#171) . Has this been resolved?thanks a lot for your attention.the issue above which 171 has been resolved. I upgrad to "gcc 7",and upgrad to "g++7 " and this issue got resolved
great!!
- Target "sentencepiece" requires the language dialect "CXX17"
Hi @guangyuli-uoe, from what I can see, it still looks like the machine you're running the second build attempt as mentioned here is using an outdated version of the compilers. To resolve this issue, I think the best solution might be to contact whoever manages the "gpu" machine which is using the outdated compilers, as they would have all the best information to help solve this issue!
hi, @heffernankevin
when i conduct bitext_mine, e.g.
zh-doc: 3 sentences en-sum: 2 sentences
it seems that the provided script will calculate 2*3 similarities and find the max score and output the related alignment (with max score) to the tsv file
but how could i get all results instead of only get the max one.
best wishes,
Hi @guangyuli-uoe, the bitext mining script will output all alignments which score above a certain threshold (default is 0), so not all sentences may be aligned. Can you try this with a larger number of en-zh sentences? (it should output many possible alignments).
hi @heffernankevin
i am sorry i think i did not descirbe my question clearly,
e.g.
zh: 3 sentences ( a, b, c) en: 2 sentences ( 1, 2 )
what the script does now: calculate: score(a,1), score(a,2), score(b,1), score(b,2), score(c,1), score(c,2) and get the alignment with the max score according to the target sentence (1 or 2): may be like this output to the tsv 0.7 a 1 o.6 c 2
but here i cannot know what other scores look like,
TT
apologies if my understanding is wrong
hi @heffernankevin
i am currently trying to modify the script, and it seems i could get more results
but could you please explain these two scores: fwd_scores, bwd_scores
if zh-3 sents, en-2 sents
then i found that the shape: fwd_score: (32) and bwd_scores: (23)
and then the script: scores = np.concatenate((fwd_scores.max(axis=1), bwd_scores.max(axis=1)))
here i do not know why it concatenates
best wishes,
Hi @guangyuli-uoe, if you would like scores between particular pairs like in your example above, instead of modifying the script perhaps one option here could be to try preprocessing your files to match pairs you would like to explicitly score. For example if you had the following two files:
zh | en |
---|---|
A | 1 |
B | 2 |
C |
You could then try creating the following files:
zh | en |
---|---|
A | 1 |
A | 2 |
B | 1 |
B | 2 |
C | 1 |
C | 2 |
Then embed the new zh and en files, and when running mine_bitexts.py
, use the option: --mode score
. This will then give you all the scores between each pair. I hope this helps!
@heffernankevin
haha good idea ! ^^
i will try both of them !
thanks a lot !
Closing due to inactivity but please reopen if needed.