bartvm / nmt

Neural machine translation
MIT License
2 stars 2 forks source link

attentive encoder base #17

Closed JinseokNam closed 8 years ago

JinseokNam commented 8 years ago

This code base is related to the issue #12.

JinseokNam commented 8 years ago

I have squashed all changes regarding issue #12 into a single commit 4ee1df0db8031eb1df8a980ead5e7f1b07281621 and then picked it up.

bartvm commented 8 years ago

I'm afraid many changes conflict with the ones that I merged yesterady in #16 :( Let me see what commits can be cherry-picked without too much problems.

JinseokNam commented 8 years ago

Let me know if there are too many conflicts when merging this. It might be also possible to create new branch from the current master and then to apply incremental changes to the new branch based on what I wrote in unk_nmt.

I guess this option may take less time depending on the conflicts.

bartvm commented 8 years ago

I won't be able to merge everything, yeah, there's a few confusing commits (e.g. d83d9511597584c5303e6f534a38d4da84c8f6fe seems to redo a lot of changes from earlier commits, whereas 4a9641252efc59cf8e411cd6c294a83c55f4f954 and 23bfba0414116441a385cb81a9040a3d5588879c seem to be making the same chanegs). Anyway, see #18 for the few that I think can be merged easily.

JinseokNam commented 8 years ago

The files should have the same number of lines, so if ss == "" then tt == "" must be true as well, right? If not, then there's something wrong with the data.

Typically, the WMT datasets are quite noisy, so that there are many pairs of sentences where one language is empty but something is written in the other. Also, in the common crawl dataset, you may observe that sentences in different languages in the English language data file. For example, when opening up the common crawl English file, you can see Chinese, Korean and Japanese sentences. If we apply language detection on that dataset, it will get rid of at least 10% of sentences.

bartvm commented 8 years ago

If one of the lines is empty, ss and tt wouldn't be "", they would be "\n". The check you wrote (if ss == "") only checks if the end of the file has been reached, not if the line is empty.

Also, if there are empty lines, you should probably run Moses' cleaning script (see the README I wrote) which removes all sentence pairs which have an empty sentence. I'd rather make this removal part of the preprocessing pipeline than make it part of the iterator; less messy in my mind.

anirudh9119 commented 8 years ago

Github wont let me create a pull request, as it could not find any similar files.

But just to let you know, Currently in my branch, context_nmt

bartvm commented 8 years ago

Not sure why it won't let you make a pull request... Let me know if you need help fixing that.

Could you split those things up in two PRs though? Multi-GPU stuff in one, the unknown words in another, so that we can merge the multi-GPU stuff regardless of the rest.

anirudh9119 commented 8 years ago

Umm, Its giving the error

There isn't anything to compare. Nothing to compare, branches are entirely different commit histories.

as the current master and my branch has different naming of the files(probable reason)

bartvm commented 8 years ago

Ehm, okay... Yeah, it looks like your commit history is somehow entirely separate from the master branch :confounded: Not sure how that even happens, I've never seen this. :confused:

Honestly, fixing this nicely seems like it'll be pretty complicated. I just tried rebasing on master, but the number of merge conflicts is pretty humongous because it considers your commits different from the commits on the master branch somehow. It seems that instead of merging the new commits from the master branch, you made new commits that made the same changes (e.g. you copy-pasted files from the master branch to your branch?)

My best bet is that the easiest way of fixing this, sadly, is to just create a new branch from master, and then copy the files you need for multi-GPU stuff back in manually, commit, and then make a new pull request. The history should be okay that way, and the manual work is probably less troublesome than fixing the conflicts in a rebase, while simultaneously teasing apart the multi-GPU and context stuff.

anirudh9119 commented 8 years ago

Umm, I changed the names of the files from the master, to make it compatible with multi-gpu stuff, I did not copy pasted something.

But yeah, I think I should make a new branch, as you said, and then make the pull issue.

anirudh9119 commented 8 years ago

Also, I will only commit(for now) in seperate branches

bartvm commented 8 years ago

Did you change filenames using mv or git mv? Changing them using mv and then doing git rm and git add will make Git consider them to be entirely different files instead of renames, so maybe that's it?

anirudh9119 commented 8 years ago

Yes, I used mv.