Training the system with different data

j6mes commented 5 years ago

Hi, is it possible to re-train the system with different data? What scripts do I need to run to do this? There seems a lot of python files and I'm not sure which ones to call.

MichalPitr commented 4 years ago

@j6mes Have you ever figured out how to retrain the system? I'm trying to get it working on Czech wiki, but it's very unclear how to move forward.

j6mes commented 4 years ago

No - I never needed to go through and retrain the entire system. For me, i got best value out of just putting it into a docker image and calling it as a black-box.

Perhaps @easonnie could advise on how to retrain the system?

MichalPitr commented 4 years ago

@j6mes Thanks for the reply, I have played around with your fork quite a lot, so thanks for the cleaned up version. Hopefully @easonnie finds the time to advise on retraining.

ShyamSubramanian commented 4 years ago

@MichalPitr I had previously experimented with training their sentence retrieval and verification models. I do not have a compact version of the training code at the moment. I will just give you some quick steps and I think it is somewhat easy to figure out the rest.

Make sure you can properly do inference based on their README, since this ensures that you have all the required installations in place
Use the auto_pipeline.py to get the output of Document retrieval step for both training and dev datasets by setting the proper values for default_steps variable. Steps to be executed are from s1.tokenizing to s2.2.1doc_nn_retri. (Use the output files for rest of the training steps)
For sentence retrieval training, use the method train_fever_v1 from the file src/sentence_retrieval/simple_nnmodel.py
For claim verification training, use the method train_fever_v1_advsample from the file src/nli/mesim_wn_simi_v1_2.py

Let me know if you get stuck somewhere!

MichalPitr commented 4 years ago

@ShyamSubramanian Thanks, that's really useful. I am especially interested in getting the document retrieval working on my Czech wiki database, but the auto_pipeline.py uses a file id_dict.json that I haven't figured out how to generate using the code.

easonnie / combine-FEVER-NSMN

Training the system with different data #6