ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
91 stars 82 forks source link

Add an option to show the normalized count #156

Closed JavierJia closed 7 years ago

JavierJia commented 8 years ago

... by the number of users or by population.

It could be implemented by using the lookup query.

JavierJia commented 7 years ago

The first step can be done by purely adding a frontend option box to normalize the geographic count result.

shengjix@uci.edu

JavierJia commented 7 years ago

@HotLemonJuice

ShengjieXu commented 7 years ago

I'm writing the crawler to extract the population data from https://www.citypopulation.de/USA-Cities.html here is the state population data I crawled. It's in json format.

[
{"state_population": "4,863,300", "state_name": "Alabama", "state_abbr": "AL"},
{"state_population": "741,894", "state_name": "Alaska", "state_abbr": "AK"},
{"state_population": "6,931,071", "state_name": "Arizona", "state_abbr": "AZ"},
{"state_population": "2,988,248", "state_name": "Arkansas", "state_abbr": "AR"},
......
]

The population data is the latest one the website can provide. @JavierJia How does it look?

JavierJia commented 7 years ago

pretty nice! can you also crawl the city populations?

JavierJia commented 7 years ago

I added @luochen01 , it is the use case of new lookup statement. #295 And @luochen01 can also help Shengjie to store the population json into the AsterixDB. You two can have a close communication :-)

ShengjieXu commented 7 years ago

OK, now I have crawled city and county population as well.

city population

[
{"city_name": "Abilene", "city_population": "121,721", "city_admin": "TX"},
{"city_name": "Akron", "city_population": "197,542", "city_admin": "OH"},
{"city_name": "Albuquerque", "city_population": "559,121", "city_admin": "NM"},
...
]

county population

[
{"county_population": "55,416", "county_admin": "Alabama", "county_name": "Autauga"},
{"county_population": "208,563", "county_admin": "Alabama", "county_name": "Baldwin"},
{"county_population": "25,965", "county_admin": "Alabama", "county_name": "Barbour"},
{"county_population": "22,643", "county_admin": "Alabama", "county_name": "Bibb"},
{"county_population": "57,704", "county_admin": "Alabama", "county_name": "Blount"},
...
]

The state and county population data are complete. But the city population is:

The population of all cities, towns and unincorporated places in the United States of America with more than 100,000 inhabitants according to census results and latest official estimates.

So there might be some cities missing in the data. But I could do better by getting all the city data. I'll do it tomorrow morning.

@JavierJia Do I need to include the crawling scripts in the scripts folder and make it as a commit?

JavierJia commented 7 years ago

nice! you don't need to include the crawling scripts in the github. you can put it inside one of our google drive as a record and add the link here in the issue list :-)

ShengjieXu commented 7 years ago

I have finished crawling all the population data (state/county/and all the cities). I have included the output json files and python code in the shared Google Drive folder. https://drive.google.com/open?id=0B7pIQ7oDAr4UYVRuR0h2cnRIX0E

JavierJia commented 7 years ago

Cool! Can you follow AsterixDB's documentation, design the data schema, and ingest the those JSON to AsterixDB? (using your local docker instance)

I think you make need to do some cleaning to make the state/county/city name or id consistent with what we are using in cloudberry (under neo/public/data/*.json).

After that, you can use the latest lookup-inside-groupby request in the front-end to get the joined result contains stateId , aggregate_count, population.

luochen01 commented 7 years ago

@HotLemonJuice I think you need to make pre-processing of state/county/city to be consistent with our current dataset (we use id for state/county/city). After that, I may help you work together to ingest those data into AsterixDB, and figure out some queries to do normalization. Maybe we can set up some time, like tomorrow afternoon, if the data can be pre-processed before that.

ShengjieXu commented 7 years ago

I have cleaned the state_population.json. Just a quick update on the output.

[
    {
        "stateID": 1,
        "state_population": "4863300",
        "state_name": "Alabama",
        "state_abbr": "AL"
    },
    {
        "stateID": 2,
        "state_population": "741894",
        "state_name": "Alaska",
        "state_abbr": "AK"
    },
    {
        "stateID": 4,
        "state_population": "6931071",
        "state_name": "Arizona",
        "state_abbr": "AZ"
    },
...
]

I have worked out a mostly cleaned version of population data. But there are many special cases which makes some of the data have null IDs. For example, there is a county called Oglala Lakota but it was called Shannon before 2015, so in the county.json, Shannon is stored, but in the population data I crawled, Oglala Lakota is stored. There is no way to know "Shannon" is the same as "Oglala Lakota" before Googling it. So it ends up with a null ID... Another nasty example is that "Baltimore City" and "Baltimore" both are Counties of the State Maryland, but in the county.json, they are stored with the same name, "Baltimore", only with different countyID and geoID. It causes confusion when I try to assign county IDs to "Baltimore City" and "Baltimore". Other common examples are that some county/city aren't in the county.json/city.json at all. Those special cases are hard to solve autonomously and it might take a long time for me to correct them manually (city.json must have more mismatches!) How should I handle this?

ShengjieXu commented 7 years ago

After using a new technique and some manually editing, I have cleaned all the data. Now all the cities/counties/states in the original json files have their associated population data. The new technique is to make my population crawled data as a lookup source instead of the original data in our project. I can go into details about it... In a word, after that, there are only 1 in state level, 10 in county level, and about 40 in city level needs to be manually edited (e.g. to Google them). Now the population data is up in the shared Google folder under data-collection/population. I have included the scripts as well. https://drive.google.com/open?id=0B7pIQ7oDAr4UYVRuR0h2cnRIX0E

JavierJia commented 7 years ago

nice! mind to briefly reveal the "new technique"?

ShengjieXu commented 7 years ago

Okay. Basically, I used Python to construct a nested dictionary using the population data I crawled and use it to look up and update the population field when looping through each record in the *.json. For example, at the city level, the dictionary looks like:

{
  cityA: {
    countyA: {
      stateA: {
        'population': ..., 
        'duplicate': ...
      },
      stateB: {
        'population': ..., 
        'duplicate': ...
      },...
    },
    countyB: {
      stateF: {
        'population': ..., 
        'duplicate': ...
      },
      stateG: {
        'population': ..., 
        'duplicate': ...
      },...
    },...
  },
  ...
}

Then when I loop through data in *.json, for a given city record, I look up its population from the dictionary by matching all three keys, e.g. city name, county name, and state name. If all three keys are matched, the population is set, otherwise, set it to null temporarily. Also, I used the duplicate field to keep track of special cases where some cities have the same city, county, and state name and mark them for manual editing (luckily there are very few of them). Also, fortunately, the mismatch rate turned out to be very low, so I was able to manually edit 50ish records in total in the end.

chenlica commented 7 years ago

Cool. Job well done, @HotLemonJuice !

@JavierJia What's a good place to store such data files? In general, where do we put those data files (e.g., city/count/state polygons)?

JavierJia commented 7 years ago

If the data is directly used by the front-end, like city/county/state polygons, we store it in github. If it's big, we store it on google drive. For this population table, I think if it's not big, we can put it inside the noah/src/main/resources/ folder.

chenlica commented 7 years ago

Sounds good. A general suggestion: make such data sets and logic separate from the middleware in terms of the module structure.

ShengjieXu commented 7 years ago

The largest file is the city population json, which is 3MB. The other two files are much smaller than it, with 300KB for county and 3KB for state.

ShengjieXu commented 7 years ago

The population data are in our team's Google Drive under cloudberry-shared-workplace/data-collection/population. They are named as state_population_cleaned_final.adm, county_population_cleaned_final.adm, and city_population_cleaned_final.adm. They have been changed into Asterix Data Model. I suggest you use the import statement in AsterixDB to ingest the data. Check out the Import/Export Statements section in https://ci.apache.org/projects/asterixdb/sqlpp/manual.html to ingest data using the local file system.

I have included the ingestion scripts in noah/src/main/resources/population/sqlpp

ShengjieXu commented 7 years ago

Here is the UI design for the Normalize Option screen shot 2017-04-11 at 11 46 44 It only shows after the user sends some query and it can toggle normalization on and off. How do you guys think of the design?

chenlica commented 7 years ago

LGTM!

ShengjieXu commented 7 years ago

A sneak peek of normalization normalize1 mov

JavierJia commented 7 years ago

cool. are you using the lookup request now? that will be great. for the UI, I think the 10^-x may be too small?

ShengjieXu commented 7 years ago

Yes, I'm using the lookup request. I used the superscripts to give the legends an elegant look but I have to admit that they are a bit small. What workaround would you suggest @JavierJia ? Increase the font size? Or make them bold? Or use e-x to represent 10^-x instead?

JavierJia commented 7 years ago

multiplying the number by*1000x1000, we can claim that is the average count per million users. btw, the white color should not be shown in legend as the original ui does.

ShengjieXu commented 7 years ago

I chose to show the white color because in this case numbers that are smaller than 10^-7 will be set to white. So it is okay to not indicate this at all?

JavierJia commented 7 years ago

In the previous UI, while means 0, and no need to show on the legend. I think we can reduce the level of counts in normalization UI to be consistent with the organic-count one.

ShengjieXu commented 7 years ago

what about now? screen shot 2017-04-16 at 10 59 25

JavierJia commented 7 years ago

it's consistent with the count one. it's better IMHO.

chenlica commented 7 years ago

I agree with @JavierJia. In general the number of tweets per million is supposedly not too big.

ShengjieXu commented 7 years ago

The implementation for normalization is mostly done. Now I'm fixing a bug and re-crawl and clean the population data. After that, I will submit a PR.

side notes: I found the bug today which resulted in approximately 10K records missing at the city level and 94 missing at the county level... So far I'm able to reduce the number of missing records down to 881 at city and 78 at the county.

JavierJia commented 7 years ago

sounds good. so it's not the bug of the cloudberry, right? :-)

ShengjieXu commented 7 years ago

rest assured. It's not from cloudberry :)

chenlica commented 7 years ago

Looking forward to the full integration with the middleware!

ShengjieXu commented 7 years ago

The colors for the normalized results are similar because they are statistically close to each other. So we need to use a dynamic scale for the legend in order to display more variated colors. One way to do it is to use (max - min) / (# of legends) to divide the scale.

chenlica commented 7 years ago

@HotLemonJuice please try the new formula and see it looks.