Closed JavierJia closed 7 years ago
The first step can be done by purely adding a frontend option box to normalize the geographic count result.
shengjix@uci.edu
@HotLemonJuice
I'm writing the crawler to extract the population data from https://www.citypopulation.de/USA-Cities.html here is the state population data I crawled. It's in json format.
[
{"state_population": "4,863,300", "state_name": "Alabama", "state_abbr": "AL"},
{"state_population": "741,894", "state_name": "Alaska", "state_abbr": "AK"},
{"state_population": "6,931,071", "state_name": "Arizona", "state_abbr": "AZ"},
{"state_population": "2,988,248", "state_name": "Arkansas", "state_abbr": "AR"},
......
]
The population data is the latest one the website can provide. @JavierJia How does it look?
pretty nice! can you also crawl the city populations?
I added @luochen01 , it is the use case of new lookup
statement. #295
And @luochen01 can also help Shengjie to store the population json into the AsterixDB. You two can have a close communication :-)
OK, now I have crawled city and county population as well.
[
{"city_name": "Abilene", "city_population": "121,721", "city_admin": "TX"},
{"city_name": "Akron", "city_population": "197,542", "city_admin": "OH"},
{"city_name": "Albuquerque", "city_population": "559,121", "city_admin": "NM"},
...
]
[
{"county_population": "55,416", "county_admin": "Alabama", "county_name": "Autauga"},
{"county_population": "208,563", "county_admin": "Alabama", "county_name": "Baldwin"},
{"county_population": "25,965", "county_admin": "Alabama", "county_name": "Barbour"},
{"county_population": "22,643", "county_admin": "Alabama", "county_name": "Bibb"},
{"county_population": "57,704", "county_admin": "Alabama", "county_name": "Blount"},
...
]
The state and county population data are complete. But the city population is:
The population of all cities, towns and unincorporated places in the United States of America with more than 100,000 inhabitants according to census results and latest official estimates.
So there might be some cities missing in the data. But I could do better by getting all the city data. I'll do it tomorrow morning.
@JavierJia Do I need to include the crawling scripts in the scripts folder and make it as a commit?
nice! you don't need to include the crawling scripts in the github. you can put it inside one of our google drive as a record and add the link here in the issue list :-)
I have finished crawling all the population data (state/county/and all the cities). I have included the output json files and python code in the shared Google Drive folder. https://drive.google.com/open?id=0B7pIQ7oDAr4UYVRuR0h2cnRIX0E
Cool! Can you follow AsterixDB's documentation, design the data schema, and ingest the those JSON to AsterixDB? (using your local docker instance)
I think you make need to do some cleaning to make the state/county/city name or id consistent with what we are using in cloudberry (under neo/public/data/*.json
).
After that, you can use the latest lookup
-inside-groupby request in the front-end to get the joined result contains stateId , aggregate_count, population
.
@HotLemonJuice I think you need to make pre-processing of state/county/city to be consistent with our current dataset (we use id for state/county/city). After that, I may help you work together to ingest those data into AsterixDB, and figure out some queries to do normalization. Maybe we can set up some time, like tomorrow afternoon, if the data can be pre-processed before that.
I have cleaned the state_population.json. Just a quick update on the output.
[
{
"stateID": 1,
"state_population": "4863300",
"state_name": "Alabama",
"state_abbr": "AL"
},
{
"stateID": 2,
"state_population": "741894",
"state_name": "Alaska",
"state_abbr": "AK"
},
{
"stateID": 4,
"state_population": "6931071",
"state_name": "Arizona",
"state_abbr": "AZ"
},
...
]
I have worked out a mostly cleaned version of population data. But there are many special cases which makes some of the data have null IDs.
For example, there is a county called Oglala Lakota but it was called Shannon before 2015, so in the county.json
, Shannon is stored, but in the population data I crawled, Oglala Lakota is stored. There is no way to know "Shannon" is the same as "Oglala Lakota" before Googling it. So it ends up with a null ID...
Another nasty example is that "Baltimore City" and "Baltimore" both are Counties of the State Maryland, but in the county.json
, they are stored with the same name, "Baltimore", only with different countyID
and geoID
. It causes confusion when I try to assign county IDs to "Baltimore City" and "Baltimore".
Other common examples are that some county/city aren't in the county.json/city.json
at all.
Those special cases are hard to solve autonomously and it might take a long time for me to correct them manually (city.json must have more mismatches!)
How should I handle this?
After using a new technique and some manually editing, I have cleaned all the data. Now all the cities/counties/states in the original json files have their associated population data. The new technique is to make my population crawled data as a lookup source instead of the original data in our project. I can go into details about it... In a word, after that, there are only 1 in state level, 10 in county level, and about 40 in city level needs to be manually edited (e.g. to Google them). Now the population data is up in the shared Google folder under data-collection/population. I have included the scripts as well. https://drive.google.com/open?id=0B7pIQ7oDAr4UYVRuR0h2cnRIX0E
nice! mind to briefly reveal the "new technique"?
Okay. Basically, I used Python to construct a nested dictionary using the population data I crawled and use it to look up and update the population field when looping through each record in the *.json. For example, at the city level, the dictionary looks like:
{
cityA: {
countyA: {
stateA: {
'population': ...,
'duplicate': ...
},
stateB: {
'population': ...,
'duplicate': ...
},...
},
countyB: {
stateF: {
'population': ...,
'duplicate': ...
},
stateG: {
'population': ...,
'duplicate': ...
},...
},...
},
...
}
Then when I loop through data in *.json, for a given city record, I look up its population from the dictionary by matching all three keys, e.g. city name, county name, and state name. If all three keys are matched, the population is set, otherwise, set it to null temporarily. Also, I used the duplicate
field to keep track of special cases where some cities have the same city, county, and state name and mark them for manual editing (luckily there are very few of them). Also, fortunately, the mismatch rate turned out to be very low, so I was able to manually edit 50ish records in total in the end.
Cool. Job well done, @HotLemonJuice !
@JavierJia What's a good place to store such data files? In general, where do we put those data files (e.g., city/count/state polygons)?
If the data is directly used by the front-end, like city/county/state polygons, we store it in github.
If it's big, we store it on google drive.
For this population table, I think if it's not big, we can put it inside the noah/src/main/resources/
folder.
Sounds good. A general suggestion: make such data sets and logic separate from the middleware in terms of the module structure.
The largest file is the city population json, which is 3MB. The other two files are much smaller than it, with 300KB for county and 3KB for state.
The population data are in our team's Google Drive under cloudberry-shared-workplace/data-collection/population
. They are named as state_population_cleaned_final.adm
, county_population_cleaned_final.adm
, and city_population_cleaned_final.adm
. They have been changed into Asterix Data Model.
I suggest you use the import statement in AsterixDB to ingest the data. Check out the Import/Export Statements
section in https://ci.apache.org/projects/asterixdb/sqlpp/manual.html to ingest data using the local file system.
I have included the ingestion scripts in noah/src/main/resources/population/sqlpp
Here is the UI design for the Normalize Option It only shows after the user sends some query and it can toggle normalization on and off. How do you guys think of the design?
LGTM!
A sneak peek of normalization
cool. are you using the lookup
request now? that will be great.
for the UI, I think the 10^-x
may be too small?
Yes, I'm using the lookup request. I used the superscripts to give the legends an elegant look but I have to admit that they are a bit small. What workaround would you suggest @JavierJia ? Increase the font size? Or make them bold? Or use e-x to represent 10^-x instead?
multiplying the number by*1000x1000
, we can claim that is the average count per million users.
btw, the white color should not be shown in legend as the original ui does.
I chose to show the white color because in this case numbers that are smaller than 10^-7 will be set to white. So it is okay to not indicate this at all?
In the previous UI, while means 0, and no need to show on the legend. I think we can reduce the level of counts in normalization UI to be consistent with the organic-count one.
what about now?
it's consistent with the count one. it's better IMHO.
I agree with @JavierJia. In general the number of tweets per million is supposedly not too big.
The implementation for normalization is mostly done. Now I'm fixing a bug and re-crawl and clean the population data. After that, I will submit a PR.
side notes: I found the bug today which resulted in approximately 10K records missing at the city level and 94 missing at the county level... So far I'm able to reduce the number of missing records down to 881 at city and 78 at the county.
sounds good. so it's not the bug of the cloudberry, right? :-)
rest assured. It's not from cloudberry :)
Looking forward to the full integration with the middleware!
The colors for the normalized results are similar because they are statistically close to each other. So we need to use a dynamic scale for the legend in order to display more variated colors. One way to do it is to use (max - min) / (# of legends) to divide the scale.
@HotLemonJuice please try the new formula and see it looks.
... by the number of users or by population.
It could be implemented by using the
lookup
query.