OpenCCG / openccg

OpenCCG library for parsing and realization with CCG
http://openccg.sourceforge.net/
Other
205 stars 45 forks source link

add Dockerfile for automated build #24

Open lpmi-13 opened 5 years ago

lpmi-13 commented 5 years ago

I'm not exactly sure why, but when I tried to use line continuation to put all the ENV variable declarations on one line, everything blew up...so leaving it as multiple layers for now. I've also not had much luck building things in the smaller alpine containers, so that's why I went with the bigger (but more robust) ubuntu 16.04 base image.

Ideally, we would have a Dockerfile that features a clean build with everything in source control, but if that's not feasible, then I suggest we might use this to put something in a publicly accessible docker hub image, to be used with docker-compose, and then it wouldn't have to constantly pull in dependencies from sourceforge.

I also only did cursory testing on this (ie, with the tccg command in /grammars/tiny), so there might be something I missed during the compilation process, but happy to update this PR in that event.

Lastly, if it seems sensible to add in a short section in the README about installing docker, running it as non-root (at least in linux), and using it to build and run the image, I'm happy to add that as part of this PR as well.

lpmi-13 commented 5 years ago

Have updated the README with some basic install and usage instructions for docker, as well as grabbing most of the jar files from the maven repository rather than sourceforge. I'm still not entirely sure which functionality is missing (didn't bother building KenLM or hunting down NERApp.jar, but can take another look if necessary). Tested with the instructions from the README and everything seemed to have built alright though.

Happy to either add on to this PR or open a new one if there is additional functionality that needs more build steps.

mwhite14850 commented 5 years ago

I gather you haven't tried any of the steps from docs/ccgbank-README? (Looks like that's not linked from the main README, which it should be.)

On Mon, Mar 18, 2019 at 7:27 PM Adam Leskis notifications@github.com wrote:

Have updated the README with some basic install and usage instructions for docker, as well as grabbing most of the jar files from the maven repository rather than sourceforge. I'm still not entirely sure which functionality is missing (didn't bother building KenLM or hunting down NERApp.jar, but can take another look if necessary). Tested with the instructions from the README and everything seemed to have built alright though.

Happy to either add on to this PR or open a new one if there is additional functionality that needs more build steps.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-474139992, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxvzXuef2ueP6undcEl1GAzkVd5fH-dks5vYCDigaJpZM4b1aK3 .

lpmi-13 commented 5 years ago

Ah right, that and the taggers README are just what I'd been missing!

Will give them a read and update this PR when they're integrated into the docker build.

mwhite14850 commented 5 years ago

Cool beans!

On Mon, Mar 18, 2019 at 7:43 PM Adam Leskis notifications@github.com wrote:

Ah right, that and the taggers README are just what I'd been missing!

Will give them a read and update this PR when they're integrated into the docker build.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-474143441, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxvzQQRsXfFf3gar85OIwOJdUwsCXKoks5vYCSNgaJpZM4b1aK3 .

lpmi-13 commented 5 years ago

I'm a bit stuck at the following section from the ccgbank-README:

Since the pre-built English models and CCGbank data for training represent much larger downloads than the OpenCCG core files, they are available as separate downloads (where YYYY-MM-DD represents the date of creation):

english-models.YYYY-MM-DD.tgz ccgbank-data.YYYY-MM-DD.tgz


I wasn't able to find these in either the openccg project or anywhere on the ccgBank site. Would you be able to provide a bit of guidance?

dmhowcroft commented 5 years ago

Ah, there's no Git LFS or similar solution set up for the data files, so they're still hosted on Source Forge: https://sourceforge.net/projects/openccg/files/data/

I don't think this is mentioned explicitly in the README; there's just the pointer to get the libs from SF.

lpmi-13 commented 5 years ago

got it...if those two archives aren't tending to move that much (looks like 2013 was the last update?), then any objections to just storing the uncompressed version in the github source?

dmhowcroft commented 5 years ago

I think we should at least keep them out of the main branch on GitHub because not everyone needs them. 90 MB might not be much by today's standards (with terribly wasteful Electron apps all over the place), but I would rather handle it separately in the Dockerfile and let folks who just need the git repo avoid downloading it altogether.

lpmi-13 commented 5 years ago

ah, fair point. I guess we might want to decide on whether the Dockerfile would be intended to target just the bare bones minimum functionality then?

I'm sure it's my lack of domain knowledge here, but I wasn't really able to figure out from the README what the basic functions of the project are (eg, as a subset of the complete set of functions). Would it be possible to specify the default functions (and ideally with examples of the commands and expected outputs) we definitely want in a docker container, and then I can aim to target that?

What the README suggests to me is that we have three basic functions that we would expect in any minimally functional installation (specified in the "Trying it out", "Visualizing semantic graphs", and "Creating disjunctive logical forms" sections), though do please correct me if that's not the case.

Alternately, if we would prefer to have a bit more of the functionality, including the parsing and tagging (as specified in docs/taggers-README), I'm happy to attempt to add that in as well, since solving the installation/configuration issues once in the docker container would make it scalable for future use by other researchers.

mwhite14850 commented 5 years ago

Hi Adam

I would say there are two different kinds of users, namely (1) ones interested in using or creating precise, domain-specific grammars and (2) ones interested in using the broad coverage English grammar for parsing and (especially) realization.

I would agree that the first group of users would appreciate not having to download large model files that they don't need. The second group of users would generally also like to have the basic functionality in "Trying it out" and "Visualizing semantic graphs". (I'm not sure how much "Creating disjunctive logical forms" is getting used.)

Not sure what this means re the Dockerfile though, is it possible to have two, or one with options?

Mike

On Mon, Apr 8, 2019 at 1:42 PM Adam Leskis notifications@github.com wrote:

ah, fair point. I guess we might want to decide on whether the Dockerfile would be intended to target just the bare bones minimum functionality then?

I'm sure it's my lack of domain knowledge here, but I wasn't really able to figure out from the README what the basic functions of the project are (eg, as a subset of the complete set of functions). Would it be possible to specify the default functions (and ideally with examples of the commands and expected outputs) we definitely want in a docker container, and then I can aim to target that?

What the README suggests to me is that we have three basic functions that we would expect in any minimally functional installation (specified in the "Trying it out", "Visualizing semantic graphs", and "Creating disjunctive logical forms" sections), though do please correct me if that's not the case.

Alternately, if we would prefer to have a bit more of the functionality, including the parsing and tagging (as specified in docs/taggers-README), I'm happy to attempt to add that in as well, since solving the installation/configuration issues once in the docker container would make it scalable for future use by other researchers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-480932132, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxvzUmHJnAOrER7kZfJcj-3WUTaIdGVks5ve39pgaJpZM4b1aK3 .

lpmi-13 commented 5 years ago

Hi Mike,

I appreciate your patience with this whole process. I'm very aware of my lack of context on this project, and that's probably leading to a lot of questions that wouldn't otherwise arise. I'm definitely focused on creating something that's useful for you and the project users, so I'm very happy for you and the other maintainers to drive this.

In terms of options for Docker implementations, we could technically add a second Dockerfile for a non-default container, though this isn't usually done and is considered a bit non-standard (ergo, perhaps not immediately obvious to users that the option exists).

What might be easier is to set up the build so that if the user wants to use those extra models, they would be able to download those locally and have them automatically mounted into the default container at runtime. One of the advantages of this is it would just involve passing additional parameters in the command rather than any difference in the Dockerfile per se.

And then in terms of end-usage, I'm assuming that the results of most of the commands involve writing things to files (rather than, say, just outputting results of exploratory data analysis to the console)? This would primarily have implications for whether users would need to enter the running container vs. just firing commands at it (the former being slightly more complex), but in any case it would be possible to set up the final container to take input and return output...just trying to think through what the final usable solution looks like.

lpmi-13 commented 5 years ago

So I've added in a conditional script in the Dockerfile to deal with the english-models.YYYY-MM-DD.tgz file if it exists, and skip it if it does not.

In addition, I've now gotten the following commands to complete successfully:

...though this is where I'm stuck currently

mwhite14850 commented 5 years ago

Hi Adam

The CCGbank is licensed by the LDC and can only be obtained from them directly, that’s why this part is set up the way it is.

Perhaps it would make sense to just skip this and document the reason why? I would say that only the most expert users would be likely to want to do this step anyway.

Thanks Mike

On Tue, May 7, 2019 at 1:13 AM Adam Leskis notifications@github.com wrote:

So I've added in a conditional script in the Dockerfile to deal with the english-models.YYYY-MM-DD.tgz file if it exists, and skip it if it does not.

In addition, I've now gotten the following commands to complete successfully:

tccg

ccg-draw-graph -i tb.xml -v graphs/g

-

ccg-build -f build-ps.xml test-novel &> logs/log.ps.test.novel &

ccg-build -f build-rz.xml test-novel &> logs/log.rz.test.novel &

...though this is where I'm stuck currently

  • Building English models from the CCGBank You'll also need to create a symbolic link to your original CCGbank directory from $OPENCCG_HOME/ccgbank/. (what would the original CCGbank directory be? I'm unable to find anything in the system that looks like ccgbank1.1)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-489911129, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TNKU32WKTCVOSUB733PUEFXNANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

Hi Mike,

Alright, I think we're almost there. I've added a comment as per your suggestion into the README documentation and skipped building the English models for now.

Just to confirm, are the POS and supertaggers intended for normal use? If so, I can go ahead and get those working in the docker container as well, but wanted to check with you first just in case those are a similar feature like the CCGBank English model building and shouldn't be included in the default container.

Thanks, Adam

mwhite14850 commented 5 years ago

Thanks for the update!

Yes, all the taggers are for normal use.

Mike

On Wed, May 8, 2019 at 2:38 AM Adam Leskis notifications@github.com wrote:

Hi Mike,

Alright, I think we're almost there. I've added a comment as per your suggestion into the README documentation and skipped building the English models for now.

Just to confirm, are the POS and supertaggers intended for normal use? If so, I can go ahead and get those working in the docker container as well, but wanted to check with you first just in case those are a similar feature like the CCGBank English model building and shouldn't be included in the default container.

Thanks, Adam

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-490367207, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TPEAAUNLFNBU4SMGGLPUJYPXANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

I've been able to compile both the maxent toolkit (from source on github) as well as the srilm package (available from one of the google code archives...version 1.6.0, but seems like it might work).

The current sticking point is when I attempt to run the command: ccg-build -f build-original.xml &> logs/log.original &

the training fails with the following log ouput:

Buildfile: /openccg/ccgbank/build-original.xml

init:

make-corpus-splits:
     [echo] Making corpus splits in ./original/corpus

BUILD FAILED
/openccg/ccgbank/build-original.xml:46: /openccg/ccgbank/ccgbank1.1/data/AUTO does not exist.

Total time: 0 seconds

...which leads me to believe that it might be related to the same issue as previously, where since I don't have the CCGBANK data, it fails. Any thoughts?

mwhite14850 commented 5 years ago

Yes, I'm sure that's the same issue.

If you have an LDC license and have or can get the CCGbank, you could test this out. Otherwise the tests for parsing and generating novel text with the existing models is as far as I'd expect you to be able to get.

Note that the maxent toolkit and SRILM are primarily for training models from scratch -- in principle the JNI code for using SRILM as a runtime language model could also be used, but hasn't been anytime recently, as it's mostly superseded by KenLM.

On Thu, May 9, 2019 at 5:26 PM Adam Leskis notifications@github.com wrote:

I've been able to compile both the maxent toolkit (from source on github) as well as the srilm package (available from one of the google code archives...version 1.6.0, but seems like it might work).

The current sticking point is when I attempt to run the command: ccg-build -f build-original.xml &> logs/log.original &

the training fails with the following log ouput:

Buildfile: /openccg/ccgbank/build-original.xml

init:

make-corpus-splits: [echo] Making corpus splits in ./original/corpus

BUILD FAILED /openccg/ccgbank/build-original.xml:46: /openccg/ccgbank/ccgbank1.1/data/AUTO does not exist.

Total time: 0 seconds

...which leads me to believe that it might be related to the same issue as previously, where since I don't have the CCGBANK data, it fails. Any thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-491074993, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TPVZWHXMCJJPVA2DX3PUSJIVANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

ah, got it. I don't happen to have any access to the CCGbank, but perhaps you might know a user who would be interested in testing out the docker container build?

In any event, does your note about the maxent and SRILM mean that we wouldn't necessarily want them in the default docker build? I'm assuming the KenLM download, as massive as it is, is also not something we'd want in the default docker build.

I'm happy to remove the steps for adding/compiling the maxent/SRILM stuff if that seems appropriate, and do please let me know if anything else would be necessary to finish this pull request for the docker automated build.

Thanks, Adam

mwhite14850 commented 5 years ago

Yes, maybe comment out the bits for maxent and SRILM.

Have you been able to test that using gigaword4.5g.kenlm.bin works if it's there? The log files from running (parsing and) realization on novel text should be slightly different if it's in the expected location.

Perhaps Dave Howcroft could try out the docker container build ...

On Thu, May 9, 2019 at 6:00 PM Adam Leskis notifications@github.com wrote:

ah, got it. I don't happen to have any access to the CCGbank, but perhaps you might know a user who would be interested in testing out the docker container build?

In any event, does your note about the maxent and SRILM mean that we wouldn't necessarily want them in the default docker build? I'm assuming the KenLM download, as massive as it is, is also not something we'd want in the default docker build.

I'm happy to remove the steps for adding/compiling the maxent/SRILM stuff if that seems appropriate, and do please let me know if anything else would be necessary to finish this pull request for the docker automated build.

Thanks, Adam

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-491083969, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TMGGQQFP6JSQULGOVDPUSNHHANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

I tried parsing and realizing using gigaword4.5g.kenlm.bin, and it seems to have completed successfully, though I'm not sure if that was because it correctly picked it up, or if it fell back to the default trigram model. Would there be something in particular in the log files that might indicate that?

I was also curious about the line in the README to set the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OPENCCG_HOME/lib

It looked like $LD_LIBRARY_PATH wasn't resolving to anything, since it hadn't been set yet, so I attempted to just set it to $OPENCCG_HOME/lib like so: export LD_LIBRARY_PATH=$OPENCCG_HOME/lib

Was there a previous step where $LD_LIBRARY_PATH had been set...or should that be a different variable?

dmhowcroft commented 5 years ago

LD_LIBRARY_PATH is a standard linux environment variable that is usually empty but can contain a list of places to look for libraries before searching in the standard places. (More here)

In Linux, the environment variable LD_LIBRARY_PATH is a colon-separated set of directories where libraries should be searched for first, before the standard set of directories; this is useful when debugging a new library or using a nonstandard library for special purposes.

So I don't think it's unusual for it to be unset right now.

Per @mwhite14850's suggestion, I think I can test the docker image, but I'm very busy for the coming weeks so I can't guarantee a particular time for testing. If I find the time, I'll update this thread that I'm working on it, and otherwise I'll try to get back to it sometime in June.

mwhite14850 commented 5 years ago

If it can't find the big LM, the log file should contain this message: "Reusing trigram model as a stand-in for the big LM"

If that message isn't there, that should mean that it found the big LM successfully; to test this, just temporarily move or rename the gigaword4.5g.kenlm.bin file and see if this message appears when running again.

This message is in the file ccgbank/plugins/MyNgramCombo.java, one of a set of plugin classes used to do flexible configuration.

On Fri, May 10, 2019 at 4:55 AM Adam Leskis notifications@github.com wrote:

I tried parsing and realizing using gigaword4.5g.kenlm.bin, and it seems to have completed successfully, though I'm not sure if that was because it correctly picked it up, or if it fell back to the default trigram model. Would there be something in particular in the log files that might indicate that?

I was also curious about the line in the README to set the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OPENCCG_HOME/lib

It looked like $LD_LIBRARY_PATH wasn't resolving to anything, since it hadn't been set yet, so I attempted to just set it to $OPENCCG_HOME/lib like so: export LD_LIBRARY_PATH=$OPENCCG_HOME/lib

Was there a previous step where $LD_LIBRARY_PATH had been set...or should that be a different variable?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24#issuecomment-491214411, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TJOZFXAZ4KG6VGTHA3PUU2ATANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

ah, thanks for the pointer @dmhowcroft ...I'm a bit embarrassed to admit I've never encountered the LD_LIBRARY_PATH in my adventures through linux land. Good to know!

I've been examining the log files, though now I can't seem to get the parser to work, which is strange, since it was working earlier with the same Dockerfile. At any rate, no rush to test this, since I need to do some investigation into this anyway.

lpmi-13 commented 5 years ago

alright, so I've confirmed that the gigaword4.5g.kenlm.bin file works if present. The lines you mentioned were only in the logs when I removed the file. Additionally, I've pushed the version of the Dockerfile that works for all the basic functionality, along with lines to get the maxent toolkit and SRILM working, which are commented out for now (plus a line in the README detailing the same).

Again, this could do with a full test to see if I've set it up correctly. Version 1.6 of SRILM was the only one I could find freely available (and possible to pull in via Dockerfile).

No rush on this. Feel free to test when convenient and we can move forward from there.

dmhowcroft commented 5 years ago

I just tried to install on my laptop running Fedora 30 and encountered the following error at the end of the Docker build process:

Error: Could not find or load main class org.apache.tools.ant.launch.Launcher
The command '/bin/sh -c ./models.sh &&     mvn dependency:copy-dependencies -DoutputDirectory='./lib' &&     mv lib/stanford-corenlp-3.9.2.jar ccgbank/stanford-nlp/stanford-core-nlp.jar &&     jar xf lib/stanford-corenlp-3.9.2-models.jar &&     cp edu/stanford/nlp/models/ner/* ccgbank/stanford-nlp/classifiers/ &&     rm -rf edu &&     ccg-build' returned a non-zero code: 1

Steps to reproduce:

  1. Install Docker and start it

    sudo dnf install docker
    sudo systemctl start docker
  2. Install the repo and checkout the right branch

    git clone https://github.com/lpmi-13/openccg.git
    cd openccg
    git checkout dockerize
  3. Run the Docker build process

    sudo docker build .
mwhite14850 commented 5 years ago

I was going to go ahead and accept the maven pull request, does this issue affect that?

Thanks!

On Tue, Jul 16, 2019 at 10:02 AM Dave Howcroft notifications@github.com wrote:

I just tried to install on my laptop running Fedora 30 and encountered the following error at the end of the Docker build process:

Error: Could not find or load main class org.apache.tools.ant.launch.Launcher The command '/bin/sh -c ./models.sh && mvn dependency:copy-dependencies -DoutputDirectory='./lib' && mv lib/stanford-corenlp-3.9.2.jar ccgbank/stanford-nlp/stanford-core-nlp.jar && jar xf lib/stanford-corenlp-3.9.2-models.jar && cp edu/stanford/nlp/models/ner/* ccgbank/stanford-nlp/classifiers/ && rm -rf edu && ccg-build' returned a non-zero code: 1

Steps to reproduce:

  1. Install Docker and start it

    sudo dnf install docker sudo systemctl start docker

    1. Install the repo and checkout the right branch

    git clone https://github.com/lpmi-13/openccg.git cd openccg git checkout dockerize

    1. Run the Docker build process

    sudo docker build .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24?email_source=notifications&email_token=AA6G7TLTG6JPHRCCJF2HS5LP7XIH5A5CNFSM4G6VUK32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2A6MIA#issuecomment-511829536, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TNQWGRKCZVXTM5KJL3P7XIH5ANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

The maven pull request can go in whenever you're ready, and I'll take a look at reproducing this error in the docker build today or tomorrow

mwhite14850 commented 5 years ago

Ok, great, just did the merge (as noted on the pull request thread). Thanks!

On Tue, Jul 16, 2019 at 11:07 AM Adam Leskis notifications@github.com wrote:

The maven pull request can go in whenever you're ready, and I'll take a look at reproducing this error in the docker build today or tomorrow

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenCCG/openccg/pull/24?email_source=notifications&email_token=AA6G7TJ5T3EULQOSVEBO6U3P7XP2ZA5CNFSM4G6VUK32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BFFNY#issuecomment-511857335, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6G7TIZILZSZYE5RJ3VGDLP7XP2ZANCNFSM4G6VUK3Q .

lpmi-13 commented 5 years ago

The failure is due to javacc, which may somehow have gotten out of sync. I notice it's also included in the master branch now via the merge two days ago, so I'll attempt to update using that instead and we can see where we are.