ArangoDB-Community / ArangoBnB

16 stars 3 forks source link

AQL: Dataset Modeling #16

Closed cw00dw0rd closed 3 years ago

cw00dw0rd commented 3 years ago

The current dataset we intend to use is here https://www.kaggle.com/brittabettendorf/berlin-airbnb-data?select=listings_summary.csv

This may need to change if we find it isn't able to fulfill some of the features we require for the project. However, the initial investigation looks promising.

We will need to model the data and once that is done we will update the readme to contain the working dataset.

Feel free to contribute to this as well if you have a suggested/preferred approach.

Simran-B commented 3 years ago

There's in fact a newer Berlin dataset from 2020-12-21: http://insideairbnb.com/get-the-data.html It even comes with a neighborhood GeoJSON file!

Simran-B commented 3 years ago

Import notes

Using the Berlin, Germany dataset from 21 December, 2020

listings.csv (summary)

listings.csv.gz

reviews.csv (summary)

reviews.csv.gz

calendar.csv.gz

neighbourhoods.geojson

image

image

Indexing

We should add indexes for the following fields (collections), aside from what we will do with ArangoSearch:

Is the conversion of ß to ss with the German text Analyzer part of the accent removal?

Dump & Restore

Notes about making the dataset available for others.

c9> ./build/bin/arangodump --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb --server.authentication false --threads 5 --collection listings --collection reviews --collection neighborhoods --collection calendar --collection _analyzers --include-system-collections dump_2021-03-12
c9> ./build/bin/arangorestore --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb_backup --create-database --server.authentication false --threads 5 --include-system-collections dump_2021-xx-xx

Dumping remotely (arangodump locally, arangod remotely) seems to be really slow. Dumping on the same machine as the server process only takes a minute on the other hand.

cw00dw0rd commented 3 years ago

We will need to change how the latitude and longitude values are being stored. They will need to either be GeoJSON, an array, or stored as sub-properties. https://www.arangodb.com/docs/devel/arangosearch-analyzers.html#geojson

I think we have enough information to make descriptive GeoJSON objects for the listings.

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [listing.latitude, listing.longitude]
  },
  "properties": {
    "name": listing.name
  }
}

What do you think?

Simran-B commented 3 years ago

Good point, we should make it a GeoJSON Point, which moves the coordinates to a sub attribute. Not sure if there's a benefit in moving the name to the GeoJSON properties field. Maybe outside of ArangoDB, when exported?

cw00dw0rd commented 3 years ago

I don't know that there is a huge benefit to putting the name in the properties. Perhaps if we were creating a separate collection of just GeoJSON objects but otherwise I don't think it will make a difference.

cw00dw0rd commented 3 years ago

Added first dump: https://drive.google.com/drive/folders/1crMM2RRpdVgi7gkblAlAZXTvIoNNVYbT?usp=sharing

cw00dw0rd commented 3 years ago

Found that price above 999 were not being included, TO_NUMBER doesn't consider a number with a , as valid and thus set 1,000+ to 0. Updated the query to take out the , and now they are included

FOR doc IN listings UPDATE doc WITH { price: TO_NUMBER(SUBSTRING(SUBSTITUTE(doc.price, ",",""), 1)) } IN listings