Fix error that cause 0 timestamp and "市" in city field

zhaofeng-shu33 commented 4 years ago

Can you explain how to run the data migration python code in the README documentation? I am new to this project and a little bit confused.

Vopaaz commented 4 years ago

Can you explain how to run the data migration python code in the README documentation? I am new to this project and a little bit confused.

Added in the latest commit.

zhaofeng-shu33 commented 4 years ago

I wonder what is the purpose of using Python to put data into the database? We can execute sql data files to do this.

Vopaaz commented 4 years ago

I wonder what is the purpose of using Python to put data into the database? We can execute sql data files to do this.

Because the data is dynamically updated. We pull data from Baidu's API each day, reshape into the desired format and store it into the DB.

zhaofeng-shu33 commented 4 years ago

[feng@bcm point-to-point-migration]$ python integration.py 
Traceback (most recent call last):
  File "integration.py", line 101, in <module>
    res = get_p2p_overall_dataframe()
  File "integration.py", line 55, in get_p2p_overall_dataframe
    history_curve = load_history(date, row.adcode)
  File "integration.py", line 26, in load_history
    update_history_if_outdated("in", city_id)
  File "integration.py", line 18, in update_history_if_outdated
    with open(path, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './temp/move_in_history_110000.txt'

The first time I run the migration code. Please help.

Vopaaz commented 4 years ago

Would you please try the latest version? @zhaofeng-shu33

Vopaaz commented 4 years ago

By the way, if you want to run the code for test purposes, please comment those lines that actually dump data into the DB.

zhaofeng-shu33 commented 4 years ago

Would you please try the latest version? @zhaofeng-shu33

I will try again.

zhaofeng-shu33 commented 4 years ago

So why sleep every 1 second ? https://github.com/Glacier-Ice/data-sci-api/blob/fix-index-data/src-etl/point-to-point-migration/crawl.py#L22

zhaofeng-shu33 commented 4 years ago

After I comment the sleep code, I get a lot of txt files after running python integration.py. So how can I put those txt data into database?

Vopaaz commented 4 years ago

So why sleep every 1 second ? https://github.com/Glacier-Ice/data-sci-api/blob/fix-index-data/src-etl/point-to-point-migration/crawl.py#L22

If you request the API too frequently you will be blocked.

Vopaaz commented 4 years ago

After I comment the sleep code, I get a lot of txt files after running python integration.py. So how can I put those txt data into database?

https://github.com/Glacier-Ice/data-sci-api/blob/fix-index-data/src-etl/point-to-point-migration/main.py does this.

zhaofeng-shu33 commented 4 years ago

For main.py I found you load the database config file from environment variable. But why you use json.loads? It is actually loading from string.

Vopaaz commented 4 years ago

For main.py I found you load the database config file from environment variable. But why you use json.loads? It is actually loading from string.

In the production environment, the config is indeed stored directly in the environment variable, instead of a path.

zhaofeng-shu33 commented 4 years ago

For main.py I found you load the database config file from environment variable. But why you use json.loads? It is actually loading from string.

In the production environment, the config is indeed stored directly in the environment variable, instead of a path.

So I have to save the json string in this environment variable instead of a json file path?

Vopaaz commented 4 years ago

For main.py I found you load the database config file from environment variable. But why you use json.loads? It is actually loading from string.

In the production environment, the config is indeed stored directly in the environment variable, instead of a path.

So I have to save the json string in this environment variable instead of a json file path?

That's your choice. Personally I have a local branch using json.load and set the CONFIG_PATH as the config file path. After development, I cherry-pick all necessary commits to the remote-tracking branch and push.

zhaofeng-shu33 commented 4 years ago

That's cool

zhaofeng-shu33 commented 4 years ago

I find the script in main.py creates two tables "migration_index" and "p2p_migration". What's the purpose of these two tables? Are they used in the API server?

Stockard commented 4 years ago

Hi @zhaofeng-shu33 . Very good question. Apparently we need a guideline to introduce the tables. p2p_migration is for migration index from city to city(what p2p stands for), which starts from late January 2020, because the API only goes back for 30 days. while migration_index stores outflow and inflow index from one city, and could ideally date back to 2019 if fixed. I bet someone may have an interest in comparing the migration index from year to year to measure the drops in this year. http://qianxi.baidu.com/ may give you intuition.

Would any of you write a short manual to avoid the trouble of config setting again? Thanks a lot.

zhaofeng-shu33 commented 4 years ago

I have added some documentation.

Glacier-Ice / data-sci-api

Fix error that cause 0 timestamp and "市" in city field #15