Using mongdb and how training works

xuewyang commented 4 years ago

Hi Alasdair, Thank you for your great work. I have two questions. Hope you can help me out.

How is mangodb working? I never used it and don't know the working technique. Also, i think it is not free right? Then can we use it normally without restriction?
I runned the script for training, but it does not begin. Here are some log: CUDA_VISIBLE_DEVICES=1,0 tell train expt/goodnews/1_lstm_glove/config.yaml -f INFO Beginning training. INFO Epoch 0/99 INFO Peak CPU memory usage MB: 5925.988 INFO GPU 0 memory usage MB: 429 INFO GPU 1 memory usage MB: 2743 INFO Training 0%| | 0/32768 [00:00<?, ?it/s]INFO Grabbing all article IDs 424692it [00:01, 397156.04it/s] INFO Grabbing all article IDs 424692it [00:01, 393577.69it/s] INFO Grabbing all article IDs 424692it [00:00, 426967.37it/s] INFO Grabbing all article IDs 424692it [00:01, 387438.86it/s] INFO Grabbing all article IDs 424692it [00:01, 407133.43it/s] INFO Grabbing all article IDs 424692it [00:01, 402774.73it/s] INFO Grabbing all article IDs 424692it [00:01, 359579.70it/s] INFO Grabbing all article IDs 424692it [00:01, 340838.22it/s] INFO Grabbing all article IDs 424692it [00:01, 346355.51it/s] INFO Grabbing all article IDs 424692it [00:01, 353768.78it/s] INFO Grabbing all article IDs 424692it [00:01, 354917.52it/s] INFO Grabbing all article IDs 424692it [00:01, 380445.27it/s] INFO Grabbing all article IDs 424692it [00:01, 381014.01it/s] INFO Grabbing all article IDs 424692it [00:01, 363095.28it/s] INFO Grabbing all article IDs 424692it [00:01, 339872.10it/s] INFO Grabbing all article IDs 424692it [00:01, 349797.14it/s] INFO Grabbing all article IDs 424692it [00:01, 408447.04it/s] INFO Grabbing all article IDs 424692it [00:01, 398397.04it/s] INFO Grabbing all article IDs 424692it [00:01, 393266.35it/s] INFO Grabbing all article IDs 424692it [00:01, 406017.05it/s] INFO Grabbing all article IDs 424692it [00:01, 403882.39it/s] INFO Grabbing all article IDs 424692it [00:01, 395658.22it/s] INFO Grabbing all article IDs 424692it [00:01, 393007.09it/s] INFO Grabbing all article IDs 424692it [00:01, 389535.83it/s] INFO Grabbing all article IDs 424692it [00:01, 399880.16it/s] INFO Grabbing all article IDs 424692it [00:01, 398192.39it/s] INFO Grabbing all article IDs 424692it [00:01, 405539.25it/s] INFO Grabbing all article IDs 424692it [00:01, 399291.60it/s] INFO Grabbing all article IDs 424692it [00:01, 406026.40it/s] INFO Grabbing all article IDs 424692it [00:01, 403983.98it/s] INFO Grabbing all article IDs 424692it [00:01, 399414.98it/s] INFO Grabbing all article IDs 424692it [00:01, 398510.95it/s] INFO Grabbing all article IDs 424692it [00:01, 396719.27it/s] INFO Grabbing all article IDs 424692it [00:01, 406964.33it/s] INFO Grabbing all article IDs 424692it [00:01, 401002.45it/s] INFO Grabbing all article IDs 424692it [00:01, 405226.78it/s] INFO Grabbing all article IDs 424692it [00:01, 406299.79it/s] INFO Grabbing all article IDs 424692it [00:01, 407338.82it/s] INFO Grabbing all article IDs 424692it [00:01, 400473.33it/s] INFO Grabbing all article IDs 424692it [00:01, 395020.26it/s] INFO Grabbing all article IDs 424692it [00:01, 413156.09it/s] INFO Grabbing all article IDs 424692it [00:01, 394949.84it/s] INFO Grabbing all article IDs 424692it [00:01, 399998.15it/s] INFO Grabbing all article IDs 424692it [00:01, 408911.63it/s] INFO Grabbing all article IDs 424692it [00:01, 399839.32it/s] INFO Grabbing all article IDs 424692it [00:01, 404657.13it/s] INFO Grabbing all article IDs 424692it [00:01, 396817.36it/s] INFO Grabbing all article IDs 424692it [00:01, 406625.99it/s] INFO Grabbing all article IDs 424692it [00:01, 398783.95it/s] INFO Grabbing all article IDs 424692it [00:01, 408517.29it/s] INFO Grabbing all article IDs 424692it [00:01, 400594.37it/s] INFO Grabbing all article IDs 424692it [00:01, 399896.59it/s] INFO Grabbing all article IDs 424692it [00:01, 400533.57it/s] INFO Grabbing all article IDs 424692it [00:01, 400635.28it/s] INFO Grabbing all article IDs 424692it [00:01, 400084.40it/s] INFO Grabbing all article IDs 424692it [00:01, 402045.37it/s] INFO Grabbing all article IDs 424692it [00:01, 394245.90it/s] INFO Grabbing all article IDs 424692it [00:01, 399113.03it/s] INFO Grabbing all article IDs 424692it [00:01, 401794.90it/s] INFO Grabbing all article IDs 424692it [00:01, 402143.31it/s] INFO Grabbing all article IDs 424692it [00:01, 400516.28it/s] INFO Grabbing all article IDs 424692it [00:01, 403411.79it/s] INFO Grabbing all article IDs 424692it [00:01, 399718.91it/s] INFO Grabbing all article IDs 424692it [00:01, 398771.01it/s] INFO Grabbing all article IDs 424692it [00:01, 402635.62it/s] INFO Grabbing all article IDs 424692it [00:01, 410380.88it/s] INFO Grabbing all article IDs 424692it [00:01, 401062.13it/s] 385594it [00:01, 1386.63it/s]

xuewyang commented 4 years ago

I also retrieve all data successfully by: mongorestore --db goodnews --host=localhost --port=27017 --drop --gzip --archive=goodnews-2020-04-21 2020-06-25T22:21:40.146-0400 576180 document(s) restored successfully. 0 document(s) failed to restore

xuewyang commented 4 years ago

Oh, my bad. I didn't setup the right image folder. Please help me with the mangodb question. Thank you.

alasdairtran commented 4 years ago

Yep you're right about the image folder. Those repeated log messages come from here. Basically in the for-loop below it, if the program can't find any images, it will just keep skipping to the next iteration, and you're stuck in an infinite loop! Make sure you you extract the images to data/goodnews/images_processed or update the image path in the config file.

MongoDB Community Edition is free. We don't need the advanced features in the paid enterprise version since we're not deploying it in a production environment. I think you have installed it correctly since you've managed to successfully restored the dump.

In the README file under Getting Data, I provided some sample Python code on how to interact with the database to retrieve an article from the database, get the corresponding image, etc. using the pymongo library. You can use that code as a starting point. Maybe copy and paste that code into a Jupyter notebook and just print out the results to see how the data are stored and represented.

If you just want to replicate the experiments, the only two commands you need is starting the mongo database and restoring the dump (both of which you have successfully done).

xuewyang commented 4 years ago

Yes, it is trained successfully now. Can you please also explain how to train without install 'tell'? I am working for someone. Even though I can run your script, I can't not install on their services. Can you please tell me what are the 'train', 'dataloader', 'model', 'evaluate', etc? What scripts involved in each part? Thank you. This also help me to implement my own model because I cannot install 'tell' for my model.

xuewyang commented 4 years ago

I made it work, though it took me some time. Thank you.

alasdairtran commented 4 years ago

When you call

tell train expt/goodnews/1_lstm_glove/config.yaml -f

that's really just a shortcut for

python -m tell.commands train expt/goodnews/1_lstm_glove/config.yaml -f

so the entry point of the program is here.

You still need to install the tell package because that's the easiest way for the files in different directories to reference each other. If you can't install tell, you need to turn all the import tell lines into relative imports, which would look quite messy.

A lot of the programming patterns used in this repo are based on the AllenNLP library. In particular, when you run tell train, it calls AllenNLP's train_model function here, which will eventually make use of the trainer here.

All the dataloaders sit in this directory and all the models sit in this directory. Which data loader or model the program will use depends on what you specify in the config.yaml file.

I suggest having a quick look at the AllenNLP tutorials if you want to see how everything (config file, trainer, dataloader, etc) ties together.

xuewyang commented 4 years ago

Thank you, very helpful. Will let you know if I have more questions.

alasdairtran commented 4 years ago

Do you use Anaconda Python? I think you can install mongo without admin access in Anaconda:

conda install -c anaconda mongodb

Otherwise I can give you a JSON dump if you want.

On Wed, 1 Jul 2020 at 14:19, Xuewen Yang notifications@github.com wrote:

Hi Alasdair, If I don't use mangodb, can I find a way to get processed data? My machine has120GB memory. Maybe that is enough. The reason I want to do without mangodb is that I have to ask admin to install it which might not be approved, but I will try to persuade him.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alasdairtran/transform-and-tell/issues/4#issuecomment-652180174, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQXV4HTEJXMUX5K4CNXN3DRZK2MJANCNFSM4OJKXJOQ .

xuewyang commented 4 years ago

Yes, it works. I don't know that I can use conda to install it. Thank you. I used https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ before.

I installed it via conda. But now when I run mongorestore --db goodnews --host=localhost --port=27017 --drop --gzip --archive=data/mongobackups/goodnews-2020-04-21 It still needs to apt-install If possible, please give me a JSON dump so that I won't need docker. My email is xuewen.yang@stonybrook.edu

Thank you.

alasdairtran commented 4 years ago

Ok sure. I'll extract and upload the JSON dump. It might take a few days.

On Wed, 1 Jul 2020 at 22:57, Xuewen Yang notifications@github.com wrote:

Yes, it works. I don't know that I can use conda to install it. Thank you. I used https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ before.

I installed it via conda. But now when I run mongorestore --db goodnews --host=localhost --port=27017 --drop --gzip --archive=data/mongobackups/goodnews-2020-04-21 It still needs to apt-install If possible, please give me a JSON dump so that I won't need docker. My email is xuewen.yang@stonybrook.edu

Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alasdairtran/transform-and-tell/issues/4#issuecomment-652401971, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQXV4FX7JZZIJ4TYYHWKLDRZMXCLANCNFSM4OJKXJOQ .

xuewyang commented 4 years ago

Hi, please don't bother to do that if it takes that many days. I can figure this out. The admin is helping me. I think I can make it.

alasdairtran commented 4 years ago

Cool good luck with setting up mongo then. The command to extract the JSON dump is easy, but it does probably take two or three days to upload the data to the cloud.

On Thu, 2 Jul 2020 at 01:03, Xuewen Yang notifications@github.com wrote:

Hi, please don't bother to do that if it takes that many days. I can figure this out. The admin is helping me. I think I can make it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alasdairtran/transform-and-tell/issues/4#issuecomment-652472734, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQXV4HU3ZEPN4A2DERLQWDRZNF3ZANCNFSM4OJKXJOQ .

xuewyang commented 4 years ago

Can I bother you to upload the jsons. I can't come up with setting up the mongo. Thank you.

xuewyang commented 4 years ago

I solved the problem with the following scripts. Thanks.

docker run -it --rm -v /home/xyang/captioning-xuewen/goodnews:/xuewen -v /home/xyang/mongodb:/data/db --network=host -p 27017:27017 --name news1 mongo mongorestore --db goodnews --host=localhost --port=27017 --drop --gzip --archive=goodnews-2020-04-21

alasdairtran / transform-and-tell

Using mongdb and how training works #4