dcmouth commented 2 years ago

I got the reason. that when I run install_external_tools.sh. the InstallSentencePiece () got error which update latest .how can install correctly?

heffernankevin commented 2 years ago

Hi @dcmouth, can you provide more details on the error you received when you ran ./install_external_tools.sh? Is there an error log?

guangyuli-uoe commented 2 years ago

hi

@heffernankevin

i met the same error when i run the embed.sh

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, was there an error when you installed the external tools? If so, can you comment it here? Also it appears that the segmentation fault issue is no longer occurring? (since from the logs it appears you're able to load the encoder successfully).

guangyuli-uoe commented 2 years ago

ext_tool_log.txt

hi @heffernankevin

i attached the whole log file of installing external tool here,

guangyuli-uoe commented 2 years ago

it seems there are only some warnings,

by the way, segmentation fault has been solved after updating the pytorch ! really thanks ^^

heffernankevin commented 2 years ago

@guangyuli-uoe great to hear the segmentation fault issue has been resolved! In your error logs, there is the following error: cmake: command not found when attempting to build the binaries. You can install cmake and try again? (e.g., if on osx, brew install cmake)

@dcmouth perhaps this is the same issue you faced?

guangyuli-uoe commented 2 years ago

@heffernankevin

really thanks for your kind remind ！

i can run the script on my laptop now ! ^^

could i run the codes on gpu directly ?

heffernankevin commented 2 years ago

@heffernankevin

really thanks for your kind remind ！

i can run the script on my laptop now ! ^^

could i run the codes on gpu directly ?

@guangyuli-uoe that's great! Yes you should be to run the embedding generation on the GPU (if CUDA is available).

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

there is a cmake error when run install_external_tools.sh on gpu, which is quite weird...

finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt

building code -- VERSION: 0.1.97 -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so -- Configuring done CMake Error in src/CMakeLists.txt: Target "sentencepiece" requires the language dialect "CXX17" (with compiler extensions), but CMake does not know the compile flags to use to enable it.

heffernankevin commented 2 years ago

hi, @heffernankevin

there is a cmake error when run install_external_tools.sh on gpu, which is quite weird...

finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt

building code -- VERSION: 0.1.97 -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so -- Configuring done CMake Error in src/CMakeLists.txt: Target "sentencepiece" requires the language dialect "CXX17" (with compiler extensions), but CMake does not know the compile flags to use to enable it.

Hi @guangyuli-uoe, there shouldn't be a need to recompile the external tools on GPU. You can keep your existing binaries.

dcmouth commented 2 years ago

Hi @heffernankevin
log.txt ， Can I trouble you to see where the problem is

heffernankevin commented 2 years ago

Hi @heffernankevin log.txt ， Can I trouble you to see where the problem is

Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to gcc 7 as it's having trouble finding string_view . Can you try upgrading and then re-running the script?

Also I noticed you previously had an issue downloading moses scripts (https://github.com/facebookresearch/LASER/issues/171) . Has this been resolved?

guangyuli-uoe commented 2 years ago

hi @heffernankevin

i want to align sentences in Chinese and sentence in English

here i firstly embed them into 2 files by embed.py

then i think i should compute the similarities in sentence-level,

so for the instruction in xsim, i noticed this strategy, here the "A" "B" are path to the two files ?

B: binary embedding files x and y

fp16_flag = False # set true if embeddings are saved in 16 bit embedding_dim = 1024 # set dimension of saved embeddings err, nbex = xSIM( x, y, dim=embedding_dim, fp16=fp16_flag )

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, yes that would be the correct way to calculate xsim for your embedding files.

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

when i run bucc task on gpu,

a new error occurred TT MicrosoftTeams-image

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, thanks for providing the log. For some reason you don't have permission to run the moses scripts. How did you download these? Can you try the following:

ls -lh LASER/tools-external/moses-tokenizer/tokenizer/*.perl (what is the output)

You could also try to edit them e.g. chmod 775 to give you executable access, but this shouldn't be required.

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

when i run the command, the output is as follows: ls: cannot access 'LASER/tools-external/moses-tokenizer/tokenizer/*.perl': No such file or directory

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

when i tried to download the external-tool yesterday,

it has problem: CMake Error in src/CMakeLists.txt:

(and you told me i could use the existed one, so i just copied the directory)

inflating: sentencepiece-master/third_party/protobuf-lite/zero_copy_stream_impl_lite.cc
finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt

building code -- VERSION: 0.1.97 -- The C compiler identification is GNU 4.8.5 -- The CXX compiler identification is GNU 4.8.5 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so -- Configuring done CMake Error in src/CMakeLists.txt: Target "sentencepiece" requires the language dialect "CXX17" (with compiler extensions), but CMake does not know the compile flags to use to enable it.

heffernankevin commented 2 years ago

hi, @heffernankevin

when i run the command, the output is as follows: ls: cannot access 'LASER/tools-external/moses-tokenizer/tokenizer/*.perl': No such file or directory

Hi @guangyuli-uoe, thanks for providing the clarification. As your GPU devices are located on another machine, in this case you will need to re-run the external tools script there as well to rebuild the binaries etc. Looking at your error, it seems like you may need to upgrade CMake. Perhaps this thread might help!

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

it seems that the version of my cmake is the latest,

and now i am trying to update my g++( g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)) and gcc (gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)), because their versions is quite old,,

but i do not have the sudo authority, TT

do you have any idea or suggestions about that ?

heffernankevin commented 2 years ago

hi, @heffernankevin

it seems that the version of my cmake is the latest,

and now i am trying to update my g++( g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)) and gcc (gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)), because their versions is quite old,,

but i do not have the sudo authority, TT

do you have any idea or suggestions about that ?

@guangyuli-uoe I think you will might need to reach out to your system administrator to solve this. In the meantime, does the task run in the cpu setting? (i.e. on your laptop?)

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

sorry for my late reply, i am still trying to upgrade the gcc,

i think i could run the task in my laptop,

================= here is my question about xsim

for the text a in Chinese, text b in English,

i embed them into a_emd and b_emd,

then i call the function:

fp16_flag = False # set true if embeddings are saved in 16 bit embedding_dim = 1024 # set dimension of saved embeddings err, nbex = xSIM( a_emd b_emd, dim=embedding_dim, fp16=fp16_flag, margin='absolute' )

i know that err is the number of error sentences, and nbex is the number of totoal sentences ( if my understand is right ==) thus i could calculate the error rate,

but how to get the alignment index for sentences ?

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, you should be able to use $LASER/source/mine_bitexts.py to do this. For example:

python mine_bitexts.py file_A file_B
--src-lang lang_A --trg-lang lang_B 
--output alignments.tsv 
--mode mine --verbose 
--src-embeddings embeddings_A 
--trg-embeddings embeddings_B

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

i update gcc and g++ to 9.4 on gpu, which is ok for download the external_tool (i test it in cpu)

finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt

building code
-- VERSION: 0.1.97
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
-- Configuring done
-- Generating done
-- Build files have been written to: /home/LASER/tools-external/sentencepiece-master/build
Scanning dependencies of target sentencepiece_train-static
Scanning dependencies of target sentencepiece-static
Scanning dependencies of target sentencepiece
[ 1%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unicode_script.cc.o
......

but it still fails:

finishing deferred symbolic links: sentencepiece-master/python/test/botchan.txt -> ../../data/botchan.txt

building code
-- VERSION: 0.1.97
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found TCMalloc: /usr/lib64/libtcmalloc_minimal.so
-- Configuring done
CMake Error in src/CMakeLists.txt:
Target "sentencepiece_train-static" requires the language dialect "CXX17"
(with compiler extensions), but CMake does not know the compile flags to
use to enable it.

and here are my current versions (on gpu), but you can find above that : it says the GNU version is 4.8.5

gcc (GCC) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, it looks like the second example above (where the error occurs) may have been run on a different machine since it has an older version of gcc installed. Is this the case? If so, are you able to upgrade it there as well?

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

i have double-checked the version, and i think it is the new version (on gpu),

but i do now know why the identified version is the old one for the script,

i just download the new version and make install and source it but did not remove the old one,

intuitively the old version will be replaced by the new version ?

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

this is the locations

(base) [uhtred]username: which gcc /home/username/gcc-9.4/bin/gcc (base) [uhtred]username: which make /usr/bin/make

but i noticed that the location of make and gcc should be in the same directory ? here are results online:

devops@devops-osetc:~$ which gcc /usr/bin/gcc devops@devops-osetc:~$ which g++ /usr/bin/g++ devops@devops-osetc:~$ which make /usr/bin/make

dcmouth commented 2 years ago

Hi @heffernankevin log.txt ， Can I trouble you to see where the problem is

Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to gcc 7 as it's having trouble finding string_view . Can you try upgrading and then re-running the script?

Also I noticed you previously had an issue downloading moses scripts (#171) . Has this been resolved?

thanks a lot for your attention.the issue above which 171 has been resolved. I upgrad to "gcc 7",and upgrad to "g++7 " and this issue got resolved

heffernankevin commented 2 years ago

Hi @heffernankevin log.txt ， Can I trouble you to see where the problem is

Hi @dcmouth, thanks for providing the log file. It looks like you might need to update your compiler to gcc 7 as it's having trouble finding string_view . Can you try upgrading and then re-running the script? Also I noticed you previously had an issue downloading moses scripts (#171) . Has this been resolved?

thanks a lot for your attention.the issue above which 171 has been resolved. I upgrad to "gcc 7",and upgrad to "g++7 " and this issue got resolved

great!!

heffernankevin commented 2 years ago

Target "sentencepiece" requires the language dialect "CXX17"

Hi @guangyuli-uoe, from what I can see, it still looks like the machine you're running the second build attempt as mentioned here is using an outdated version of the compilers. To resolve this issue, I think the best solution might be to contact whoever manages the "gpu" machine which is using the outdated compilers, as they would have all the best information to help solve this issue!

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

when i conduct bitext_mine, e.g.

zh-doc: 3 sentences en-sum: 2 sentences

it seems that the provided script will calculate 2*3 similarities and find the max score and output the related alignment (with max score) to the tsv file

but how could i get all results instead of only get the max one.

best wishes,

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, the bitext mining script will output all alignments which score above a certain threshold (default is 0), so not all sentences may be aligned. Can you try this with a larger number of en-zh sentences? (it should output many possible alignments).

guangyuli-uoe commented 2 years ago

hi @heffernankevin

i am sorry i think i did not descirbe my question clearly,

e.g.

zh: 3 sentences ( a, b, c) en: 2 sentences ( 1, 2 )

what the script does now: calculate: score(a,1), score(a,2), score(b,1), score(b,2), score(c,1), score(c,2) and get the alignment with the max score according to the target sentence (1 or 2): may be like this output to the tsv 0.7 a 1 o.6 c 2

but here i cannot know what other scores look like,

TT

apologies if my understanding is wrong

guangyuli-uoe commented 2 years ago

hi @heffernankevin

i am currently trying to modify the script, and it seems i could get more results

but could you please explain these two scores: fwd_scores, bwd_scores

if zh-3 sents, en-2 sents

then i found that the shape: fwd_score: (32) and bwd_scores: (23)

and then the script: scores = np.concatenate((fwd_scores.max(axis=1), bwd_scores.max(axis=1)))

here i do not know why it concatenates

best wishes,

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, if you would like scores between particular pairs like in your example above, instead of modifying the script perhaps one option here could be to try preprocessing your files to match pairs you would like to explicitly score. For example if you had the following two files:

zh	en
A	1
B	2
C

You could then try creating the following files:

zh	en
A	1
A	2
B	1
B	2
C	1
C	2

Then embed the new zh and en files, and when running mine_bitexts.py, use the option: --mode score. This will then give you all the scores between each pair. I hope this helps!

guangyuli-uoe commented 2 years ago

@heffernankevin

haha good idea ! ^^

i will try both of them !

thanks a lot !

heffernankevin commented 2 years ago

Closing due to inactivity but please reopen if needed.

facebookresearch / LASER

when run bash ./eval.sh & python eval.py & Call functions embed_sentences show error // spm_encode:No such file or directory #212

B: binary embedding files x and y