danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Python version #67

Open danpovey opened 8 years ago

danpovey commented 8 years ago

It looks like pocolm does not work for python version less than 2.7, because subprocess.check_output does not exist in that case.

Could someone please update the check_dependencies.sh script to check this?

Also, the check_dependencies.sh script currently dies if 'python' resolves to python3. I don't know if this is necessary-- I think all the code should be python3 compatible. But this needs testing somehow.

chris920820 commented 8 years ago

Maybe I can try if I can figure out those issues

best regards, Zhouyang

On Sep 11, 2016, at 3:01 AM, Daniel Povey notifications@github.com wrote:

It looks like pocolm does not work for python version less than 2.7, because subprocess.check_output does not exist in that case.

Could someone please update the check_dependencies.sh script to check this?

Also, the check_dependencies.sh script currently dies if 'python' resolves to python3. I don't know if this is necessary-- I think all the code should be python3 compatible. But this needs testing somehow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 8 years ago

thanks

On Sun, Sep 11, 2016 at 10:18 PM, chris920820 notifications@github.com wrote:

Maybe I can try if I can figure out those issues

best regards, Zhouyang

On Sep 11, 2016, at 3:01 AM, Daniel Povey notifications@github.com wrote:

It looks like pocolm does not work for python version less than 2.7, because subprocess.check_output does not exist in that case.

Could someone please update the check_dependencies.sh script to check this?

Also, the check_dependencies.sh script currently dies if 'python' resolves to python3. I don't know if this is necessary-- I think all the code should be python3 compatible. But this needs testing somehow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246252135, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt .

chris920820 commented 8 years ago

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well). I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called
"#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts > data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G --limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list

exited with return code 0 after 0.3 seconds

(required integer like 10G not 10.0G)

bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/int.dev.3'

running at Tue Sep 13 04:25:36 2016

sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data

exited with return code 0 after 0.0 seconds

(install numpy on python3, use python3.4 -m pip install numpy)

(Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10

On Sep 12, 2016, at 2:29 AM, Daniel Povey notifications@github.com wrote:

thanks

On Sun, Sep 11, 2016 at 10:18 PM, chris920820 notifications@github.com wrote:

Maybe I can try if I can figure out those issues

best regards, Zhouyang

On Sep 11, 2016, at 3:01 AM, Daniel Povey notifications@github.com wrote:

It looks like pocolm does not work for python version less than 2.7, because subprocess.check_output does not exist in that case.

Could someone please update the check_dependencies.sh script to check this?

Also, the check_dependencies.sh script currently dies if 'python' resolves to python3. I don't know if this is necessary-- I think all the code should be python3 compatible. But this needs testing somehow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246252135, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 8 years ago

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G --limit-unk-history=false

data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey notifications@github.com wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 notifications@github.com > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey notifications@github.com > > > wrote: > > > > > > It looks like pocolm does not work for python version less than 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script to check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if 'python' > > > resolves to python3. I don't know if this is necessary-- I think all > > > the > > > code should be python3 compatible. But this needs testing somehow. > > > > > > — > > > You are receiving this because you are subscribed to this thread. > > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > https://github.com/danpovey/pocolm/issues/67#issuecomment-246252135, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246623898, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7lcODqnoJs2UBwLy00TMW3YO1oLks5qpmpsgaJpZM4J57gt .
chris920820 commented 8 years ago

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G --limit-unk-history=false

data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey notifications@github.com wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 notifications@github.com > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey notifications@github.com > > > wrote: > > > > > > It looks like pocolm does not work for python version less than 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script to check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if 'python' > > > resolves to python3. I don't know if this is necessary-- I think all > > > the > > > code should be python3 compatible. But this needs testing somehow. > > > > > > — > > > You are receiving this because you are subscribed to this thread. > > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > https://github.com/danpovey/pocolm/issues/67#issuecomment-246252135, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246623898, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7lcODqnoJs2UBwLy00TMW3YO1oLks5qpmpsgaJpZM4J57gt . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
danpovey commented 8 years ago

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 notifications@github.com wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey notifications@github.com wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version less than > > > 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if 'python' > > > resolves to python3. I don't know if this is necessary-- I think > > > all > > > the > > > code should be python3 compatible. But this needs testing somehow. > > > > > > — > > > You are receiving this because you are subscribed to this thread. > > > Reply to this email directly, view it on GitHub, or mute the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246623898, or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246798086, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt .

chris920820 commented 7 years ago

Hello, Dan I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey notifications@github.com wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 notifications@github.com wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey notifications@github.com wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version less than > > > 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if 'python' > > > resolves to python3. I don't know if this is necessary-- I think > > > all > > > the > > > code should be python3 compatible. But this needs testing somehow. > > > > > > — > > > You are receiving this because you are subscribed to this thread. > > > Reply to this email directly, view it on GitHub, or mute the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246623898, or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246798086, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 7 years ago

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey notifications@github.com wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 notifications@github.com wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if > > > 'python' > > > resolves to python3. I don't know if this is necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed to this > > > thread. > > > Reply to this email directly, view it on GitHub, or mute the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt .

chris920820 commented 7 years ago

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey notifications@github.com wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 notifications@github.com wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights.py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words.txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that case. > > > > > > Could someone please update the check_dependencies.sh script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if > > > 'python' > > > resolves to python3. I don't know if this is necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed to this > > > thread. > > > Reply to this email directly, view it on GitHub, or mute the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 7 years ago

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 notifications@github.com wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey notifications@github.com wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that > > > case. > > > > > > Could someone please update the check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if > > > 'python' > > > resolves to python3. I don't know if this is necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed to this > > > thread. > > > Reply to this email directly, view it on GitHub, or mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9-3meIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt .

chris920820 commented 7 years ago

I understand now. Maybe I can try to trace back to see if I can find where it actually come from Thanks

best regards, Zhouyang

On Sep 13, 2016, at 7:46 PM, Daniel Povey notifications@github.com wrote:

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 notifications@github.com wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey notifications@github.com wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that > > > case. > > > > > > Could someone please update the check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently dies if > > > 'python' > > > resolves to python3. I don't know if this is necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed to this > > > thread. > > > Reply to this email directly, view it on GitHub, or mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9-3meIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 7 years ago

Also, text_to_int.py will probably be much less efficient in python3, as it will convert all the string arrays into things of type 'bytes' which occupy probably 4x more memory. But I can't see an easy fix that would be compatible in python 2 and 3.

I notice that if you use str(x) to convert a byte sequence into a string, you get b'x':

a = b'431423' str(a) "b'431423'" It seems you have to do str(a, 'utf-8') ... which is odd. This could crop up a lot in the codebase. But I'd like to make it all python3 compatible.

Dan

On Tue, Sep 13, 2016 at 4:50 PM, chris920820 notifications@github.com wrote:

I understand now. Maybe I can try to trace back to see if I can find where it actually come from Thanks

best regards, Zhouyang

On Sep 13, 2016, at 7:46 PM, Daniel Povey notifications@github.com wrote:

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 notifications@github.com wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey notifications@github.com wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey < notifications@github.com> wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that > > > case. > > > > > > Could someone please update the check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently > > > dies if > > > 'python' > > > resolves to python3. I don't know if this is > > > necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs > > > testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed to > > > this > > > thread. > > > Reply to this email directly, view it on GitHub, or > > > mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or mute > the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9- 3meIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246862715, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuzD96Pk-klxoF7KYdnamHJisOOTUks5qpzbUgaJpZM4J57gt .

danpovey commented 7 years ago

... and str(a, 'utf-8') doesn't work in python2. Maybe it's best to just avoid str() conversions where it's not necessary (i.e. when the input is some type of string already).

On Tue, Sep 13, 2016 at 5:02 PM, Daniel Povey dpovey@gmail.com wrote:

Also, text_to_int.py will probably be much less efficient in python3, as it will convert all the string arrays into things of type 'bytes' which occupy probably 4x more memory. But I can't see an easy fix that would be compatible in python 2 and 3.

I notice that if you use str(x) to convert a byte sequence into a string, you get b'x':

a = b'431423' str(a) "b'431423'" It seems you have to do str(a, 'utf-8') ... which is odd. This could crop up a lot in the codebase. But I'd like to make it all python3 compatible.

Dan

On Tue, Sep 13, 2016 at 4:50 PM, chris920820 notifications@github.com wrote:

I understand now. Maybe I can try to trace back to see if I can find where it actually come from Thanks

best regards, Zhouyang

On Sep 13, 2016, at 7:46 PM, Daniel Povey notifications@github.com wrote:

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 notifications@github.com wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey <notifications@github.com

wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey < notifications@github.com> wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/s cripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python > > > version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that > > > case. > > > > > > Could someone please update the > > > check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently > > > dies if > > > 'python' > > > resolves to python3. I don't know if this is > > > necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs > > > testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed > > > to this > > > thread. > > > Reply to this email directly, view it on GitHub, or > > > mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the > > thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or > mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-2 46854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9-3m eIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246862715, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuzD96Pk-klxoF7KYdnamHJisOOTUks5qpzbUgaJpZM4J57gt .

chris920820 commented 7 years ago

Hello, Dan Sorry for doing this late because I spend time on some HWs. I have updated the pull request and tested on python2 and python3 i) After I add "universal_newlines = True”, to all subprocess.Popen and subprocess.check_output, python can open these files as “str” not bytes. ii) issue like for n in [ 'dev' ] +range(1, num_train_sets + 1): we can add a “list” to range, but I think this can be a bit inefficient because in python2, it actually did a deep-copy of the original list (i.e. range(1, num_train_sets+1)), this time I use for n in itertools.chain( [ 'dev’ ], range(1, num_train_sets + 1)): This works for python2 and python3, without deep-copy unnecessary things. So, I think this might be a reasonable solution iii) For things like train_keys = list(train_counts.keys()), instead of leaving it as a “set” like object, we might better to convert it to a list, because after then we use index to iterate over it. iv) Other issues have already been addressed, like str(exception) sometimes cannot be printed properly, but repr(exception) works. Max-memory is float like 1000.3 unit, which raise error, we would like convert it to integer like 1000 unit.

best regards, Zhouyang

On Sep 13, 2016, at 8:03 PM, Daniel Povey notifications@github.com wrote:

... and str(a, 'utf-8') doesn't work in python2. Maybe it's best to just avoid str() conversions where it's not necessary (i.e. when the input is some type of string already).

On Tue, Sep 13, 2016 at 5:02 PM, Daniel Povey dpovey@gmail.com wrote:

Also, text_to_int.py will probably be much less efficient in python3, as it will convert all the string arrays into things of type 'bytes' which occupy probably 4x more memory. But I can't see an easy fix that would be compatible in python 2 and 3.

I notice that if you use str(x) to convert a byte sequence into a string, you get b'x':

a = b'431423' str(a) "b'431423'" It seems you have to do str(a, 'utf-8') ... which is odd. This could crop up a lot in the codebase. But I'd like to make it all python3 compatible.

Dan

On Tue, Sep 13, 2016 at 4:50 PM, chris920820 notifications@github.com wrote:

I understand now. Maybe I can try to trace back to see if I can find where it actually come from Thanks

best regards, Zhouyang

On Sep 13, 2016, at 7:46 PM, Daniel Povey notifications@github.com wrote:

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 notifications@github.com wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey <notifications@github.com

wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey < notifications@github.com> wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/s cripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python > > > version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in that > > > case. > > > > > > Could someone please update the > > > check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently > > > dies if > > > 'python' > > > resolves to python3. I don't know if this is > > > necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs > > > testing > > > somehow. > > > > > > — > > > You are receiving this because you are subscribed > > > to this > > > thread. > > > Reply to this email directly, view it on GitHub, or > > > mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the > > thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or > mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-2 46854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9-3m eIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-246862715, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuzD96Pk-klxoF7KYdnamHJisOOTUks5qpzbUgaJpZM4J57gt .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

danpovey commented 7 years ago

Sorry for doing this late because I spend time on some HWs. I have updated the pull request and tested on python2 and python3 i) After I add "universal_newlines = True”, to all subprocess.Popen and subprocess.check_output, python can open these files as “str” not bytes. ii) issue like for n in [ 'dev' ] +range(1, num_train_sets + 1): we can add a “list” to range, but I think this can be a bit inefficient because in python2, it actually did a deep-copy of the original list (i.e. range(1, num_train_sets+1)), this time I use for n in itertools.chain( [ 'dev’ ], range(1, num_train_sets + 1)): This works for python2 and python3, without deep-copy unnecessary things. So, I think this might be a reasonable solution iii) For things like train_keys = list(train_counts.keys()), instead of leaving it as a “set” like object, we might better to convert it to a list, because after then we use index to iterate over it.

I'd rather use '+' because it's easier to understand. There is no point optimizing outer loops.

iv) Other issues have already been addressed, like str(exception) sometimes cannot be printed properly, but repr(exception) works. Max-memory is float like 1000.3 unit, which raise error, we would like convert it to integer like 1000 unit.

ok thanks.

... and str(a, 'utf-8') doesn't work in python2. Maybe it's best to just avoid str() conversions where it's not necessary (i.e. when the input is some type of string already).

On Tue, Sep 13, 2016 at 5:02 PM, Daniel Povey dpovey@gmail.com wrote:

Also, text_to_int.py will probably be much less efficient in python3, as it will convert all the string arrays into things of type 'bytes' which occupy probably 4x more memory. But I can't see an easy fix that would be compatible in python 2 and 3.

I notice that if you use str(x) to convert a byte sequence into a string, you get b'x':

a = b'431423' str(a) "b'431423'" It seems you have to do str(a, 'utf-8') ... which is odd. This could crop up a lot in the codebase. But I'd like to make it all python3 compatible.

Dan

On Tue, Sep 13, 2016 at 4:50 PM, chris920820 <notifications@github.com

wrote:

I understand now. Maybe I can try to trace back to see if I can find where it actually come from Thanks

best regards, Zhouyang

On Sep 13, 2016, at 7:46 PM, Daniel Povey <notifications@github.com

wrote:

I think the real issue is a bit earlier on. python3 seem to be be happiest to deal with the 'str' type, which is a sequence of unicode characters. But we seem to have got hold of a byes-array type b'142315' somewhere. Looks like some file may have been opened in binary mode (the distinction doesn't seem to really matter in python2). It could be that we need to be passing universal_newlines=True to all the Popen calls, at least any of them that deal with text data (which I think is all of them). If not, this issue may crop up elsewhere. Dan

On Tue, Sep 13, 2016 at 4:30 PM, chris920820 < notifications@github.com> wrote:

I already fix them in a PR It is the script, "prune_lm_dir.py", function, writenumngrams, the reason I believe is that there is a type called bytes b"1234", we need explicitly convert them to "int"

best regards, Zhouyang

On Sep 13, 2016, at 7:21 PM, Daniel Povey < notifications@github.com

wrote:

Hello, Dan

I have made more changes, and tested on swbd and swbd_fisher locally and in cluster. It works fine now.

In particular, except Exception as e: if we do "print(str(e))”, it print nothing but print(repr(e)) print the right error message.

Second, in python3, the printed ngram information is in weird form. 1 b'19999' 2 b'103183' 3 b’54472’

and the correct one should be, 1 19999 2 103509 3 54192

Can you figure out a fix for this issue before we commit this pull request? Which program prints that out?

Dan

best regards, Zhouyang

On Sep 13, 2016, at 3:54 PM, Daniel Povey < notifications@github.com> wrote:

I don't think we need to warn about python3, let's just make sure it works reliably. I don't think the package "six" is necessary, it might itself cause compatibility problems, and I don't think we are using any of the features that "six" provides support for.

Dan

On Tue, Sep 13, 2016 at 12:36 PM, chris920820 < notifications@github.com> wrote:

The change in PR now make python 3.x work, and if we find the version if python 3.x, the check_dependencies.sh will prompt a message like, echo "$0: pocolm is compatible with python 3.x, but for best reliability" echo "$0: python 2.7 is recommended” Is this appropriate? And do I need to follow the suggestion, "Its probably better to import https://pypi.python.org/pypi/six library which helps to provide python 2/3 support in consistent way. You could usexrange consistently then.”

On Sep 13, 2016, at 3:29 PM, Daniel Povey < notifications@github.com> wrote:

Hello, Dan I have modify some parts of the scripts so it can check if python have been correctly set to 2.7 (besides ‘subprocess', the module ‘argparse’ cannot be properly imported in python 2.6 as well).

OK, if there is no argparse in python 2.6, then we should require python 2.7. You can check this in check_dependencies.sh; no need to check it elsewhere.

I also find the mean reason why python 3.x doesn’t work: range(n) and dic.keys() is not list, so directly indexing or concatenating with list is not allowed. After cast them to list, it works fine. I attached the detailed change I made and log file, but I don’t think it is necessary for you to read.

In your pull request, make the script changes so that python 3.x will work, and then there is no need to rule out python 3. Dan

best regards, Zhouyang

(delete "or create an bash alias for pocolm/kaldi scripts to run correctly”, because scripts called "#!/usr/bin/env python”, even I set some alias like “alias python=‘python2.7’ “, the shebang will still call the default python. In case the default python is 2.6, this can be a issue)

( range(n) is not a list in python 3.x) File "/Users/zhouyangzhang/pocolm/ scripts/validate_count_dir.py", line 144, in for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list get_objf_and_derivs.py: count-dir validation failed

( d.keys() is not list in python 3.x)

get_unigram_weights.py data/lm/work/word_counts >

data/text/unigram_weights

running at Tue Sep 13 04:15:18 2016

Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/s cripts/get_unigram_weights. py", line 66, in print(train_keys[0], 1.0) TypeError: 'dict_keys' object does not support indexing

exited with return code 0 after 0.1 seconds

( range(n) is not a list in python 3.x)

get_counts.py --min-counts='' --max-memory=10G

--limit-unk-history=false data/lm/work/int_20000 3 data/lm/work/counts_20000_3

running at Tue Sep 13 04:23:07 2016

validate_vocab.py: validated file data/lm/work/int_20000/words. txt with 20000 entries. get_counts.py: dumping counts Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/get_counts.py", line 491, in

for n in [ "dev" ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list # exited with return code 0 after 0.3 seconds (required integer like 10G not 10.0G) # bash -c 'set -o pipefail; export LC_ALL=C; gunzip -c data/lm/work/int_20000/dev.txt.gz | get-text-counts 3 | sort --buffer-size=10.0G | uniq -c | get-int-counts /dev/null data/lm/work/counts_20000_3/int.dev.2 data/lm/work/counts_20000_3/ int.dev.3' # running at Tue Sep 13 04:25:36 2016 sort: invalid suffix in --buffer-size argument '10.0G' get-int-counts: processed no data # exited with return code 0 after 0.0 seconds (install numpy on python3, use python3.4 -m pip install numpy) (Same problem: range(n) is not a list) validate_count_dir.py: validated counts directory data/lm/work/counts_20000_3_subset10 Traceback (most recent call last): File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 48, in CleanupDir(args.count_dir, ngram_order, num_train_sets) File "/Users/zhouyangzhang/pocolm/ scripts/cleanup_count_dir.py", line 25, in CleanupDir for n in [ 'dev' ] + range(1, num_train_sets + 1): TypeError: can only concatenate list (not "range") to list train_lm.py: failed to cleanup subset count dir: data/lm/work/counts_20000_3_subset10 On Sep 12, 2016, at 2:29 AM, Daniel Povey < notifications@github.com> wrote: > thanks > > On Sun, Sep 11, 2016 at 10:18 PM, chris920820 < > notifications@github.com> > wrote: > > > Maybe I can try if I can figure out those issues > > > > best regards, > > Zhouyang > > > > > On Sep 11, 2016, at 3:01 AM, Daniel Povey < > > > notifications@github.com> > > > wrote: > > > > > > It looks like pocolm does not work for python > > > version > > > less > > > than > > > 2.7, > > > because subprocess.check_output does not exist in > > > that > > > case. > > > > > > Could someone please update the > > > check_dependencies.sh > > > script > > > to > > > check > > > this? > > > > > > Also, the check_dependencies.sh script currently > > > dies if > > > 'python' > > > resolves to python3. I don't know if this is > > > necessary-- I > > > think > > > all > > > the > > > code should be python3 compatible. But this needs > > > testing > > > somehow. > > > > > > — > > > You are receiving this because you are > > > subscribed > > > to this > > > thread. > > > Reply to this email directly, view it on > > > GitHub, or > > > mute > > > the > > > thread. > > > > — > > You are receiving this because you authored the > > thread. > > Reply to this email directly, view it on GitHub > > > pocolm/issues/67#issuecomment- > > 246252135>, > > or mute > > the thread > > > auth/ > > ADJVu7xSBkQlB3rrMHWdCLZ9lEUmA7rUks5qpOC6gaJpZM4J57gt> > > . > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub, or > mute the > thread. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or mute the thread . — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/ pocolm/issues/67#issuecomment- 246798086, or mute the thread https://github.com/notifications/unsubscribe- auth/ADJVu1R-gtR3Zg5gPlWBg3_ED3hvcpe0ks5qpvsngaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-2 46854940, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu9WVutAHsdJBYNryG8FZEP6cg7hIks5qpy06gaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment- 246858934, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9-3m eIab5oe7nKTb6TMalEj9y4Rks5qpzIVgaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/danpovey/pocolm/issues/67#issuecomment-246862715 , or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuzD96Pk- klxoF7KYdnamHJisOOTUks5qpzbUgaJpZM4J57gt

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/67#issuecomment-248240054, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuwILEu6c9bP29tf9lfywA-Sw2ELMks5qr5vYgaJpZM4J57gt .