freewym / espresso

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Other
941 stars 116 forks source link

[WIP] Lhotse/K2 example #45

Open freewym opened 3 years ago

freewym commented 3 years ago

@pzelasko

freewym commented 3 years ago

@pzelasko I just drafted a data prep script in the file examples/mobvoihotwords/local/data_prep.py. I just would like to double-check with you whether I did everything correctly and efficiently.

Basically I want to augment the original training data with 1.1x/0.9 speed-perturbation, and reverberation separately, and then combined them into a single CutSet. I did that by first extracting augmented features and dump them into the disk separately, and then merging their respective CutSet and in the meantime modifying their ids (by prefixing) to differentiate utterances from the same underlying original one.

Also, I don't know the way I did speed perturbation is correct (in terms of both the use of pitch function and the value of pitch shift being passed on to the function)

Thanks

pzelasko commented 3 years ago

BTW this is in a very experimental stage, but some time ago I was able to run Lhotse feature extraction distributed on our CLSP grid with these steps (admittedly not tested with data augmentation yet):

  1. pip install dask distributed dask_jobqueue - Dask, a library that handles distributed computation in Python
  2. pip install git+https://github.com/pzelasko/plz - my wrapper for Dask dedicated for CLSP grid
  3. from dask.distributed import Client
  4. from plz import setup_cluster
  5. with setup_cluster() as cluster, Client(cluster) as ex: <- drop in replacement for process pool exec
  6. cluster.scale(num_jobs)

If you'd like you can try it, else I will try it sometime, probably using your recipe as it'll be a great testing ground for this.

freewym commented 3 years ago

Thanks for the helpful comments! There are still additional steps for data preprocessing to been done before features extraction (additive noise and split the recordings). I will try the distributed extraction once they are done.

pzelasko commented 3 years ago

@freewym FYI the subprocess crashing issue seems like a deep issue inside of libsox and its use of OpenMP, so let's not get out hopes up that this will be resolved anytime soon - I suggest sticking to WavAugment for now... anyway let's see what the torchaudio guys say about this.

freewym commented 3 years ago

@danpovey Please take a look at just the last commit in this PR related to LF-MMI training. I constructed all the phone HMMs (to be unioned together as H) and the phone LM manually. L.fst is constructed by taking optional silence into account. There are no disambig symbols at all. The HMMs are FST where ilabels are pdf-ids and olabels are (mono-)phones. Then generate_graphs.py takes all these above and produces the denominator graph (ilabels: pdf-ids; olabels: phone ids) and HL.fst (ilabels: pdf-ids; olabels: word ids). Finally in k2_lf_mmi_loss.py numerator graphs are created by composing HL.fst with the text (a single word).

I have something that I am not very clear: 1) I didn't remove any self-loops when constructing any FSTs including HMMs. Do I still need to add self loops to any of numerator/denominator graphs? 2) In Kaldi we compose denominator with numerators to have a "normalized" loss, and before that numerators's transition weights are all set to 1 to avoid double weight counting. But I am unable to figure out where we do this "weight clearance" in Kaldi. Do we also need to do it in this PR?

danpovey commented 3 years ago

oh yes sorry

On Sat, Nov 14, 2020 at 1:58 PM Yiming Wang notifications@github.com wrote:

@freewym commented on this pull request.

In espresso/criterions/k2_lf_mmi_loss.py https://github.com/freewym/espresso/pull/45#discussion_r523381700:

  • numgraphs.to(encoder_out.device)
  • num_graphs.scores.requiresgrad(False)
  • num_graphs_unrolled = k2.intersect_dense_pruned(
  • num_graphs, dense_fsa_vec, beam=100000, max_active_states=10000, min_active_states=0
  • )
  • num_scores = k2.get_tot_scores(num_graphs_unrolled, log_semiring=False, use_float_scores=True)
  • denominator computation

  • self.dengraph.to(encoder_out.device)
  • den_graph_unrolled = k2.intersect_dense_pruned(
  • self.den_graph, dense_fsa_vec, beam=100000, max_active_states=10000, min_active_states=0
  • )
  • den_scores = k2.get_tot_scores(den_graph_unrolled, log_semiring=False, use_float_scores=True)
  • obtain the loss

  • loss = -num_scores + den_scores

There is a minus sign before num_scores already. So we are maximizing num_scores

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#discussion_r523381700, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Q3NXGSPL45JDL2CLSPYL7VANCNFSM4TKMHTTA .

freewym commented 3 years ago

@danpovey I have resolved all the issues so far in the latest commit. Can you please look at run.sh and local/generate_graphs.py to see if they are correct. Also, the two issues I quoted below are still not clear to me:

@danpovey Please take a look at just the last commit in this PR related to LF-MMI training. I constructed all the phone HMMs (to be unioned together as H) and the phone LM manually. L.fst is constructed by taking optional silence into account. There are no disambig symbols at all. The HMMs are FST where ilabels are pdf-ids and olabels are (mono-)phones. Then generate_graphs.py takes all these above and produces the denominator graph (ilabels: pdf-ids; olabels: phone ids) and HL.fst (ilabels: pdf-ids; olabels: word ids). Finally in k2_lf_mmi_loss.py numerator graphs are created by composing HL.fst with the text (a single word).

I have something that I am not very clear:

  1. I didn't remove any self-loops when constructing any FSTs including HMMs. Do I still need to add self loops to any of numerator/denominator graphs?
  2. In Kaldi we compose denominator with numerators to have a "normalized" loss, and before that numerators's transition weights are all set to 1 to avoid double weight counting. But I am unable to figure out where we do this "weight clearance" in Kaldi. Do we also need to do it in this PR?
danpovey commented 3 years ago

RE the self-loops: if they were there from the start, e.g. in H, it souldn't be necessary to add any self-loops.

Regarding the normalized loss: let's not bother with that for now. We can just let it have either sign. Later on we can fix it differently, e.g. by having the LM not be normalized or including suitable cost terms when we construct the numerator FSAs.

freewym commented 3 years ago

@danpovey The Fst related files are generate_graphs.py, k2_lf_mmi_loss.py, and run.sh

danpovey commented 3 years ago

Thanks. Would you mind getting a Python stack trace from where the bug happened, using pdb?

On Mon, Nov 16, 2020 at 11:56 PM Yiming Wang notifications@github.com wrote:

@danpovey https://github.com/danpovey The Fst related files are generate_graphs.py, k2_lf_mmi_loss.py, and run.sh

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728151622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XCQFQCC6WTHPPQ2TSQFDRRANCNFSM4TKMHTTA .

danpovey commented 3 years ago

And also is it log or tropical semiring?

On Tue, Nov 17, 2020 at 2:18 PM Daniel Povey dpovey@gmail.com wrote:

Thanks. Would you mind getting a Python stack trace from where the bug happened, using pdb?

On Mon, Nov 16, 2020 at 11:56 PM Yiming Wang notifications@github.com wrote:

@danpovey https://github.com/danpovey The Fst related files are generate_graphs.py, k2_lf_mmi_loss.py, and run.sh

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728151622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2XCQFQCC6WTHPPQ2TSQFDRRANCNFSM4TKMHTTA .

freewym commented 3 years ago

And also is it log or tropical semiring? On Tue, Nov 17, 2020 at 2:18 PM Daniel Povey @.> wrote: Thanks. Would you mind getting a Python stack trace from where the bug happened, using pdb? On Mon, Nov 16, 2020 at 11:56 PM Yiming Wang @.> wrote: > @danpovey https://github.com/danpovey The Fst related files are > generate_graphs.py, k2_lf_mmi_loss.py, and run.sh > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#45 (comment)>, or > unsubscribe > https://github.com/notifications/unsubscribe-auth/AAZFLO2XCQFQCC6WTHPPQ2TSQFDRRANCNFSM4TKMHTTA > . >

The only places explicitly specifying semiring are for k2.get_tot_scores() when computing the loss. I guess the others are all in tropical ones?

freewym commented 3 years ago

And also is it log or tropical semiring? On Tue, Nov 17, 2020 at 2:18 PM Daniel Povey @.**> wrote: Thanks. Would you mind getting a Python stack trace from where the bug happened, using pdb? On Mon, Nov 16, 2020 at 11:56 PM Yiming Wang @.**> wrote: > @danpovey https://github.com/danpovey The Fst related files are > generate_graphs.py, k2_lf_mmi_loss.py, and run.sh > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#45 (comment)>, or > unsubscribe > https://github.com/notifications/unsubscribe-auth/AAZFLO2XCQFQCC6WTHPPQ2TSQFDRRANCNFSM4TKMHTTA > . >

The only places explicitly specifying semiring are for k2.get_tot_scores() when computing the loss. I guess the others are all in tropical ones?

One observation: when I use k2.from_openfst() to load as fsa and then print(fsa), the sign of its weight seems to been reversed)

danpovey commented 3 years ago

Sign change is expected. Please show python stack trace

freewym commented 3 years ago

Sign change is expected. Please show python stack trace

I use python3 -m pdb train.py .... but the program abort from pdb environment without stack trace. How can I get the stack trace?

The only message I got is /k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336)

danpovey commented 3 years ago

Maybe try this https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/

On Tue, Nov 17, 2020 at 4:10 PM Yiming Wang notifications@github.com wrote:

Sign change is expected. Please show python stack trace

I use python3 -m pdb train.py .... but the program abort from pdb environment without stack trace. How can I get the stack trace?

The only message I got is /k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728760008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5CVRCZBL4X57O75C3SQIVV5ANCNFSM4TKMHTTA .

freewym commented 3 years ago

[F] /home/ywang/fairseq4/espresso/tools/k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336) Fatal Python error: Aborted

Current thread 0x00002ba95c097700 (most recent call first): File "/export/b03/ywang/anaconda3/lib/python3.8/site-packages/k2/autograd.py", line 90 in backward File "/export/b03/ywang/anaconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 89 in apply

Thread 0x00002ba8662bc140 (most recent call first): File "/export/b03/ywang/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 130 in backward File "/export/b03/ywang/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 221 in backward File "/home/ywang/fairseq4/fairseq/optim/fairseq_optimizer.py", line 95 in backward File "/home/ywang/fairseq4/fairseq/tasks/fairseq_task.py", line 431 in train_step File "/home/ywang/fairseq4/fairseq/trainer.py", line 538 in train_step File "/export/b03/ywang/anaconda3/lib/python3.8/contextlib.py", line 75 in inner File "../../espresso/speech_train.py", line 227 in train File "/export/b03/ywang/anaconda3/lib/python3.8/contextlib.py", line 75 in inner File "../../espresso/speech_train.py", line 135 in main File "/home/ywang/fairseq4/fairseq/distributed_utils.py", line 334 in call_main File "../../espresso/speech_train.py", line 412 in cli_main File "../../espresso/speech_train.py", line 416 in

danpovey commented 3 years ago

I can't see the problem, but here's how I suggest to debug. First, please pull the latest k2 master, install locally from the bdist .whl file that you create locally (be sure to first pip uninstall the current version of k2), and make sure that you can still reproduce the problem. Then add debugging statements to _GetTotScoresFunction in autograd.py, e.g. printing out the sizes of tensors. Somehow it seems to be mixing up the FSAs and the things generated from them, so that the num_scores of one FSA are perhaps being applied to another FSA or something like that. Maybe something from the 1st minibatch is somehow being used for the second minibatch.

freewym commented 3 years ago

I may find the cause (not 100% sure):

fsa = k2.linear_fsa([[2],[2],[2]]) print(fsa)

it gives an error: k2/k2/csrc/fsa_utils.cu:FsaToString:516 Check failed: fsa.NumAxes() == 2 (3 vs. 2)

maybe @csukuangfj or @danpovey would like to take a look

danpovey commented 3 years ago

I don't think that problem is related. We should figure out the problem there (e.g. it might be invalid input that should have been checked better) but likely not related.

On Tue, Nov 17, 2020 at 5:17 PM Yiming Wang notifications@github.com wrote:

I may find the problematic cause (not 100% sure):

fsa = k2.linear_fsa([[2],[2],[2]]) print(fsa)

it gives an error: k2/k2/csrc/fsa_utils.cu:FsaToString:516 Check failed: fsa.NumAxes() == 2 (3 vs. 2)

maybe @csukuangfj https://github.com/csukuangfj or @danpovey https://github.com/danpovey would like to take a look

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728797131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY3AOCDYMDF3SZEHYLSQI5SDANCNFSM4TKMHTTA .

danpovey commented 3 years ago

mm wait, you might be right...

On Tue, Nov 17, 2020 at 5:32 PM Daniel Povey dpovey@gmail.com wrote:

I don't think that problem is related. We should figure out the problem there (e.g. it might be invalid input that should have been checked better) but likely not related.

On Tue, Nov 17, 2020 at 5:17 PM Yiming Wang notifications@github.com wrote:

I may find the problematic cause (not 100% sure):

fsa = k2.linear_fsa([[2],[2],[2]]) print(fsa)

it gives an error: k2/k2/csrc/fsa_utils.cu:FsaToString:516 Check failed: fsa.NumAxes() == 2 (3 vs. 2)

maybe @csukuangfj https://github.com/csukuangfj or @danpovey https://github.com/danpovey would like to take a look

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728797131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY3AOCDYMDF3SZEHYLSQI5SDANCNFSM4TKMHTTA .

qindazhu commented 3 years ago

fsa = k2.linear_fsa([[2],[2],[2]])

you are printing an FsaVec instead of an Fsa. seems we now don't support print FsaVec in python.

you can do print(fsa[0]) to print each fsa.

danpovey commented 3 years ago

I think in fsa.py where we do ans = "k2.Fsa: " + _fsa_to_str(self.arcs, False, aux_labels) there should be an if statement and do something like _fsa_vec_to_str otherwise. Haowen can you please do that?

On Tue, Nov 17, 2020 at 5:34 PM Haowen Qiu notifications@github.com wrote:

fsa = k2.linear_fsa([[2],[2],[2]])

you are printing an FsaVec instead of an Fsa. seems we now support print FsaVec in python.

you can do print(fsa[0]) to print each fsa.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728806879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO76SEIVF7UPQT3VUKDSQI7SXANCNFSM4TKMHTTA .

qindazhu commented 3 years ago

Sure, will do

freewym commented 3 years ago

Maybe something from the 1st minibatch is somehow being used for the second minibatch.

Looks like the case:

tot_scores torch.Size([128]) tot_scores torch.Size([128]) forward_scores torch.Size([14336]) backward_scores torch.Size([14336]) arc_scores torch.Size([21120]) out_grad torch.Size([21120]) forward_scores torch.Size([39552]) backward_scores torch.Size([39552]) arc_scores torch.Size([71552]) out_grad torch.Size([71552])

tot_scores torch.Size([128]) tot_scores torch.Size([128]) forward_scores torch.Size([14336]) backward_scores torch.Size([14336]) [F] /home/ywang/fairseq4/espresso/tools/k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336)

In the end it terminated within this line: https://github.com/k2-fsa/k2/blob/c2c6edb634ff18a80630171259921d831328f1e0/k2/python/k2/autograd.py#L90

It looks like _GetTotScoresFunction.forward() and backward() are called twice on the 1st minibatch. The problem occurs in backward of the 2nd minibatch

danpovey commented 3 years ago

Try printing the address of fsa (via gettings its base-class). And print something out so you know where things are in the minibatch processing, e.g. after each update.

On Tue, Nov 17, 2020 at 6:16 PM Yiming Wang notifications@github.com wrote:

Maybe something from the 1st minibatch is somehow being used for the second minibatch.

Looks like the case:

tot_scores torch.Size([128]) tot_scores torch.Size([128]) forward_scores torch.Size([14336]) backward_scores torch.Size([14336]) arc_scores torch.Size([21120]) out_grad torch.Size([21120]) forward_scores torch.Size([39552]) backward_scores torch.Size([39552]) arc_scores torch.Size([71552]) out_grad torch.Size([71552])

tot_scores torch.Size([128]) tot_scores torch.Size([128]) forward_scores torch.Size([14336]) backward_scores torch.Size([14336]) [F] /home/ywang/fairseq4/espresso/tools/k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336)

In the end it terminated within this line:

https://github.com/k2-fsa/k2/blob/c2c6edb634ff18a80630171259921d831328f1e0/k2/python/k2/autograd.py#L90

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-728831199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYEVXUULUYRT2ZFH6DSQJEQXANCNFSM4TKMHTTA .

qindazhu commented 3 years ago

Supported FsaVec printing in this pr https://github.com/k2-fsa/k2/pull/355

qindazhu commented 3 years ago

It looks like _GetTotScoresFunction.forward() and backward() are called twice on the 1st minibatch

I think this is correct as you call forward and backward both on num_graph and den_graph, so every batch, there should be two forward and backward calls (Recalling your loss is num_graph - den_graph)

Not sure if this is related to threads speech_train.py used, but I think you can add more logs here, e.g. after intersect_dense_pruned (both for num_graph and den_graph), actually you have gotten the fsavec, so you can print their sizes to see what the mismatch is here, you may even want to print logs using K2_LOG(INFO) << num_fsas or K2_LOG(INFO) << num_states/num_arcs in GetForwardScore/GetBackwardScore/GetArcScores/GetTotScores in fsa_utils.cu.

qindazhu commented 3 years ago

Also, I think you can remove den_graph to see if it crashes or not, if it does not crash, it may be something wrong when we do forward/backward on a same dense_fsa_vec (i.e. nnet_output, as both num_graph and den_graph intersect with it)

freewym commented 3 years ago

Also, I think you can remove den_graph to see if it crashes or not, if it does not crash, it may be something wrong when we do forward/backward on a same dense_fsa_vec (i.e. nnet_output, as both num_graph and den_graph intersect with it)

@qindazhu There is no error when only computing num_graphs. But the same error when only computing den_graph. So the problem is probably solely from den_graph. den_graph was built from the static den_fst, which was loaded once in the constructor. After I changed it to be loaded in each forward pass instead, the error is gone. So apparently the cause is the sharing of the same den_fst instance across iterations.

qindazhu commented 3 years ago

Can you show me those tensors' sizes when just using den_grpah (including den_graph's shape info), it seems to me that den_graph changes across iterations?

[F] /home/ywang/fairseq4/espresso/tools/k2/k2/csrc/fsa_utils.cu:GetArcScores:1463 Check failed: num_states == forward_scores.Dim() (47096 vs. 14336)

freewym commented 3 years ago

forward_scores's size is changing as it is obtained from the graph den_graph_unrolled which is the intersect of den_graph (not changing) and dense_fsa_vec (is changing as it is from the network's output)

freewym commented 3 years ago

My confusion is: as den_graph is not changing, but it has to be created in the every forward pass rather than putting it in the constructor

danpovey commented 3 years ago

Definitely some kind of bug... likely something in the _grad_cache of the den_graph_unrolled object. Try adding debug code to things like get_forward_scores_float.

On Wed, Nov 18, 2020 at 1:11 PM Yiming Wang notifications@github.com wrote:

My confusion is: as den_graph is not changing, but it has to be created in the every forward pass rather than putting it in the constructor

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-729428729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6S6RY3YKAHBZBCKMTSQNJQBANCNFSM4TKMHTTA .

freewym commented 3 years ago

I don't understand what the underlying purpose, but apparently at this line:

https://github.com/k2-fsa/k2/blob/c6d658ea71676e820e5fd883ff57ec5963acef19/k2/python/csrc/torch/fsa.cu#L198

if log_semiring is True, entering_arcs_tensor in the returned pair is not initialized (or just all 0's)?

danpovey commented 3 years ago

That's OK, we only need entering_arcs when semiring is tropical. They will be ignored in log case, by calling code.

On Wed, Nov 18, 2020 at 1:50 PM Yiming Wang notifications@github.com wrote:

I don't understand what the underlying purpose, but apparently at this line:

https://github.com/k2-fsa/k2/blob/c6d658ea71676e820e5fd883ff57ec5963acef19/k2/python/csrc/torch/fsa.cu#L198

if log_semiring is True, entering_arcs_tensor in the returned pair is not initialized (or just all 0's)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-729445494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6SZU5RPAJJELHZDV3SQNOAXANCNFSM4TKMHTTA .

danpovey commented 3 years ago

Around line 150 in autograd py it does:


        *for* name, a_value *in* a_fsas.named_tensor_attr():

            *if* name == 'scores':

                *continue*

            value = _k2.index_select(a_value, arc_map_a)

            *setattr*(out_fsa[0], name, value)

        *for* name, a_value *in* a_fsas.named_non_tensor_attr():

            *setattr*(out_fsa[0], name, a_value)

.. can you please print out the named tensor and non-tensor attirbute names? I suspect one of the non_tensor_attr's that shouldn't be there, e.g. _grad_cache, is mistakenly being set.

On Wed, Nov 18, 2020 at 1:58 PM Daniel Povey dpovey@gmail.com wrote:

That's OK, we only need entering_arcs when semiring is tropical. They will be ignored in log case, by calling code.

On Wed, Nov 18, 2020 at 1:50 PM Yiming Wang notifications@github.com wrote:

I don't understand what the underlying purpose, but apparently at this line:

https://github.com/k2-fsa/k2/blob/c6d658ea71676e820e5fd883ff57ec5963acef19/k2/python/csrc/torch/fsa.cu#L198

if log_semiring is True, entering_arcs_tensor in the returned pair is not initialized (or just all 0's)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-729445494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6SZU5RPAJJELHZDV3SQNOAXANCNFSM4TKMHTTA .

danpovey commented 3 years ago

... I suspect this line in fsa.py (197) if name in ('_tensor_attr', '_non_tensor_attr', 'arcs', '_properties'): should read: if name in ('_tensor_attr', '_non_tensor_attr', 'arcs', '_properties', '_grad_cache'): .. pls make PR to k2 if it fixes it.

On Wed, Nov 18, 2020 at 2:02 PM Daniel Povey dpovey@gmail.com wrote:

Around line 150 in autograd py it does:


        *for* name, a_value *in* a_fsas.named_tensor_attr():

            *if* name == 'scores':

                *continue*

            value = _k2.index_select(a_value, arc_map_a)

            *setattr*(out_fsa[0], name, value)

        *for* name, a_value *in* a_fsas.named_non_tensor_attr():

            *setattr*(out_fsa[0], name, a_value)

.. can you please print out the named tensor and non-tensor attirbute names? I suspect one of the non_tensor_attr's that shouldn't be there, e.g. _grad_cache, is mistakenly being set.

On Wed, Nov 18, 2020 at 1:58 PM Daniel Povey dpovey@gmail.com wrote:

That's OK, we only need entering_arcs when semiring is tropical. They will be ignored in log case, by calling code.

On Wed, Nov 18, 2020 at 1:50 PM Yiming Wang notifications@github.com wrote:

I don't understand what the underlying purpose, but apparently at this line:

https://github.com/k2-fsa/k2/blob/c6d658ea71676e820e5fd883ff57ec5963acef19/k2/python/csrc/torch/fsa.cu#L198

if log_semiring is True, entering_arcs_tensor in the returned pair is not initialized (or just all 0's)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-729445494, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6SZU5RPAJJELHZDV3SQNOAXANCNFSM4TKMHTTA .

freewym commented 3 years ago

Ok trying

danpovey commented 3 years ago

Do the num and/or den graphs have epsilons at this point? Can you describe the epsilons they have, if so?

On Tue, Dec 1, 2020 at 7:32 AM Yiming Wang notifications@github.com wrote:

@freewym commented on this pull request.

In espresso/criterions/k2_lf_mmi_loss.py https://github.com/freewym/espresso/pull/45#discussion_r532972215:

  • )
  • +def create_numerator_graphs(texts: List[str], HCL_fst_inv: k2.Fsa, symbols: k2.SymbolTable, den_graph=None):

  • word_ids_list = []
  • for text in texts:
  • filtered_text = [
  • word if word in symbols._sym2id else "" for word in text.split(" ")
  • ]
  • word_ids = [symbols.get(word) for word in filtered_text]
  • word_ids_list.append(word_ids)
  • fsa = k2.linear_fsa(word_ids_list) # create an FsaVec from a list of list of word ids
  • num_graphs = k2.intersect(fsa, HCL_fstinv).invert()
  • TODO: normalize numerator

  • if False: #den_graph is not None:

@danpovey https://github.com/danpovey RE the normalized loss: the normalization happens within this if block, and the graphs HCL and den_graph passed into this functions are loaded in the constructor below, and they come from local/generate_graphs.py. topo of HMMs and phone_lm are defined in "stage 1" of run.sh. The problem is: after the normalization, numerator score > denominator score

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#pullrequestreview-541378222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4CHHMGXU6A3TXPZQ3SSQTRRANCNFSM4TKMHTTA .

freewym commented 3 years ago

Do the num and/or den graphs have epsilons at this point? Can you describe the epsilons they have, if so? On Tue, Dec 1, 2020 at 7:32 AM Yiming Wang @.> wrote: @*.** commented on this pull request. ------------------------------ In espresso/criterions/k2_lf_mmi_loss.py <#45 (comment)>: > + ) + + +def create_numerator_graphs(texts: List[str], HCL_fst_inv: k2.Fsa, symbols: k2.SymbolTable, den_graph=None): + word_ids_list = [] + for text in texts: + filtered_text = [ + word if word in symbols._sym2id else "" for word in text.split(" ") + ] + word_ids = [symbols.get(word) for word in filtered_text] + word_ids_list.append(word_ids) + + fsa = k2.linear_fsa(word_ids_list) # create an FsaVec from a list of list of word ids + num_graphs = k2.intersect(fsa, HCL_fstinv).invert() + # TODO: normalize numerator + if False: #den_graph is not None: @danpovey https://github.com/danpovey RE the normalized loss: the normalization happens within this if block, and the graphs HCL and den_graph passed into this functions are loaded in the constructor below, and they come from local/generate_graphs.py. topo of HMMs and phone_lm are defined in "stage 1" of run.sh. The problem is: after the normalization, numerator score > denominator score — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4CHHMGXU6A3TXPZQ3SSQTRRANCNFSM4TKMHTTA .

Both den/num have epsilons. The epsilons are from 1) epsilon self-loops in HMMs, and 2) a epsilon path in phone-lm representing a negative path 3) some epsilon arcs in L.fst

danpovey commented 3 years ago

Mm. We have to be a bit careful about the meaning of epsilons. When they appear as self-loops in HMMs the meaning is not really epsilon, but "blank", I assume. (Meaning: they match with output 0 of the nnet). Such epsilons are not really supposed to coexist with epsilons that "mean" epsilon.

I'm not 100% sure what you mean by negative path in the phone-LM. Perhaps you mean backoff? Also the den is after composition; the epsilons in L.fst are mostly epsilons on the word side, no? Please verify exactly what epsilons are on each side of the (num * den) composition and why.

There are 2 modes in intersection: "treat_epsilons_specially" means epsilons will be treated as epsilon, for instance they will match an implicit epsilon self-loop on the other FSA. Otherwise they are treated as regular symbols.

But our current composition/intersection code does not solve the "epsilon-sequencing problem". Meaning: if there are epsilons on both sides of what you are composing, you can get multiple paths corresponding to different orders of taking epsilon on each side. That may be the cause of your issue.

Dan

On Tue, Dec 1, 2020 at 1:32 PM Yiming Wang notifications@github.com wrote:

Do the num and/or den graphs have epsilons at this point? Can you describe the epsilons they have, if so? … <#m8772799248277586884> On Tue, Dec 1, 2020 at 7:32 AM Yiming Wang @.*> wrote: @.* commented on this pull request. ------------------------------ In espresso/criterions/k2_lf_mmi_loss.py <#45 (comment) https://github.com/freewym/espresso/pull/45#discussion_r532972215>: > + ) + + +def create_numerator_graphs(texts: List[str], HCL_fst_inv: k2.Fsa, symbols: k2.SymbolTable, den_graph=None): + word_ids_list = [] + for text in texts: + filtered_text = [ + word if word in symbols.sym2id else "" for word in text.split(" ") + ] + word_ids = [symbols.get(word) for word in filtered_text] + word_ids_list.append(word_ids) + + fsa = k2.linear_fsa(word_ids_list) # create an FsaVec from a list of list of word ids + num_graphs = k2.intersect(fsa, HCL_fst_inv).invert*() + # TODO: normalize numerator + if False: #den_graph is not None: @danpovey https://github.com/danpovey https://github.com/danpovey RE the normalized loss: the normalization happens within this if block, and the graphs HCL and den_graph passed into this functions are loaded in the constructor below, and they come from local/generate_graphs.py. topo of HMMs and phone_lm are defined in "stage 1" of run.sh. The problem is: after the normalization, numerator score > denominator score — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (review) https://github.com/freewym/espresso/pull/45#pullrequestreview-541378222>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4CHHMGXU6A3TXPZQ3SSQTRRANCNFSM4TKMHTTA .

Both den/num have epsilons. The epsilons are from 1) epsilon self-loops in HMMs, and 2) a epsilon path in phone-lm representing a negative path 3) some epsilon arcs in L.fst

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freewym/espresso/pull/45#issuecomment-736230636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZQ2U6N3Q5HLBSNGL3SSR5XZANCNFSM4TKMHTTA .

freewym commented 3 years ago

OK. In terms of the composition/insection side of (num * den), as they are "transition ids" to be matched with the nnet output, the epsilon arcs were added when creating closure of H (so they are not self-loops).

Epsilons in L are on the phone id side and are on the arcs start_state->loop_state and star_state->sil_state in the case where optional silence is allowed. But as this side of L is not involved in (num*den), it may be irrelevant. I also verified that there is no epsilons in the phone-LM itself.

So if the cause is "epsilon-sequencing problem", would remove epsilon on H closure help? Would the self-loop blank also be removed if doing so?

danpovey commented 3 years ago

@freewym can you clarify that you have reserved index zero in the nnet output, so that epsilons can be treated specially in the graphs? I.e. that epsilon is not also a valid pdf-id?

freewym commented 3 years ago

This is what I am going to do: before apply fsa operations that may affect epsilon or be affected by epsilon (e.g. remove_epsilon, intersect), the fsa.labels tensor are temporarily incremented by 1 to make blank label (index 0 in HMM) be treated as if it's a normal label; after those operations fsa.labels are restored.

freewym commented 3 years ago

@danpovey I got a preliminary result, which is ~15% EER. It is still high, but at least it seems to start working. Will continue trying to improve. In the meantime if you have time, could you please take a look at how the training graph is created starting from

https://github.com/freewym/espresso/blob/1b059663d90d45a46b91e10c17530e498ec0e9a0/espresso/criterions/k2_lf_mmi_loss.py#L95

to spot any problems?

danpovey commented 3 years ago

Is the objf the right sign now?

freewym commented 3 years ago

Is the objf the right sign now?

Yes

danpovey commented 3 years ago

You shouldn't need the 'clamp' thing, BTW.

danpovey commented 3 years ago

Also, check that the sum of the output of the network averages close to zero; certain bugs might cause it to be biased in a positive or negative direction. [edit: this assumes there is no log-softmax]