Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
707 stars 132 forks source link

Loading data into local development environment #2010

Open IsaaacD opened 4 years ago

IsaaacD commented 4 years ago

Hi, I'm having trouble loading data into my local environment.

I attempted sudo mysql tatoeba < Tatoeba/docs/database/import/restore_dumps.sql but it timed out without doing much.

I was able to succesfully run each statement with the corresponding csv by entering: TRUNCATE TABLE sentences; LOAD DATA INFILE "/tmp/sentences.csv" INTO TABLE sentences FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' (id, lang, text);

etc, for each table. But When I go to load a page with data, like a random sentence I get a message saying "An error occurred while fetching random sentences. If this persists, please let us know.".

I'm wondering if anyone else has experienced this type of error or if there's more documentation I could be looking at to getting my development environment setup.

Yorwba commented 4 years ago

For testing, I just create my own sentences manually, so I can't really help with restoring from the database dumps.

However, I've encountered the message "An error occurred while fetching random sentences. If this persists, please let us know." before. It indicates that the search daemon is running (otherwise you'd get a much scarier error message) but doesn't have any sentences indexed to fetch randomly.

Either you do not have any sentences yet (check "Browse by language," it works without search index) or they weren't indexed yet.

To refresh the index, I usually SSH into the VM with

vagrant ssh

and then

cd ~/Tatoeba
sudo systemctl stop manticore; sudo bin/cake sphinx_indexes update main; sudo systemctl start manticore

Since I do not need many sentences for testing, this is reasonably fast.

To monitor the search daemon during that process, you can run

sudo journalctl --pager-end --follow 

in another SSH session.

IsaaacD commented 4 years ago

Thanks @Yorwba ! I think that's the missing piece. I navigated to "Browse by language" and the sentences were setup. I'm running the commands you said will index the sentences and it's currently running.

I mean this helps me get my environment setup, but I wonder if someone can comment about the script not working to load the data; where if I run each LOAD FILE statement separately it works?

I put some SELECT statements into the script and it runs but the only thing I see in the terminal is "First Part" being selected twice and then no further SELECT show up.

image

AndiPersti commented 4 years ago

Do you try to load the database dumps into the VM with the default 2GB (or less) memory allocated?

Even if you could get the script running to the end (I tried that a few weeks ago and as far as I can remember it finished after a very long time, i.e. several hours) you won't be able to index all the sentences because you won't have enough memory available.

For most of your local development, having the full database isn't really necessary. As Yorwba already suggested just add some sentences, translations, comments, wall messages, ... manually.

I agree that it would be easier for beginners to have a small sample database available. I can think of two possibilities: Either we should create this small database and add it to the VM or provide a seeding script which would fill the database with sample data.

And implementing https://github.com/Tatoeba/imouto/issues/51 would also help.

IsaaacD commented 4 years ago

Thanks @AndiPersti, yes I'm running this on the 2GB vm. It's already indexed 8/10 languages with the most sentences and has only used 370MB. I'll let it run to see what happens, I'm sure there's also paging that should help with overflow, no?

As far as including a subset of the data, depending how the CSVs are being created, it'd be possible to run shuf -n <<Number of lines wanted>> XXX.csv on each CSV to create a smaller subset of random sentences.

That being said it's something developers could do ourselves to setup our environment,

trang commented 4 years ago

I agree that it would be easier for beginners to have a small sample database available.

+1

Note that with the tests fixtures we actually have the data already for this small sample database. We're just missing the seeding script.

jiru commented 4 years ago

Old related/duplicate issue: #452 and rotting PR #994.