AQL: Dataset Modeling - Githubissues

cw00dw0rd commented 3 years ago

The current dataset we intend to use is here https://www.kaggle.com/brittabettendorf/berlin-airbnb-data?select=listings_summary.csv

This may need to change if we find it isn't able to fulfill some of the features we require for the project. However, the initial investigation looks promising.

We will need to model the data and once that is done we will update the readme to contain the working dataset.

Feel free to contribute to this as well if you have a suggested/preferred approach.

Simran-B commented 3 years ago

There's in fact a newer Berlin dataset from 2020-12-21: http://insideairbnb.com/get-the-data.html It even comes with a neighborhood GeoJSON file!

Simran-B commented 3 years ago

Import notes

Using the Berlin, Germany dataset from 21 December, 2020

listings.csv (summary)

Renamed to listings_summary.csv
Mapped id to _key
We may not need this one.
Aggregate how often attributes occur:
```
FOR doc IN listings_summary
FOR a IN ATTRIBUTES(doc, true)
  COLLECT attr = a WITH COUNT INTO count
  RETURN {attr, count}
```
Most attributes occur in all 20,224 documents. Only host_name, last_review and reviews_per_month are absent in some of them.
Show attributes with same name but different values compared to listings collection (sample):
```
FOR ls IN listings_summary
LIMIT 100
FOR l IN listings
  FILTER l._key == ls._key
  LET diff = MERGE(FOR attr IN ATTRIBUTES(ls, true)
  FILTER ls[attr] != l[attr] RETURN {[attr]: [l[attr], ls[attr]]})
  RETURN diff
```
- neighbourhood is often missing or "Berlin, Germany" inlistings. I suspect it's a free-text field that hosts can fill in, and the clean data appears to be inneighbourhood_cleansed`.
- neighbourhood_group is (always?) missing in listings. The data seems to be stored in neighbourhood_group_cleansed instead.
- price is a formatted string with currency symbol in listings, but a number in listings_summary.

Renamed attributes to AE

FOR doc IN listings_summary
REPLACE MERGE(
  UNSET(doc, ["neighbourhood_group", "neighbourhood"]),
  { neighborhood_group: doc.neighbourhood_group, neighborhood: doc.neighbourhood }
) IN listings_summary

listings.csv.gz

Unpacked as listings.csv
Mapped id to _key

Updated amenities and host_verifications to arrays of strings. Mapped "None" to [] for the latter, but we might want to drop the entire attribute if it has no value or give it an explicit null value.

FOR doc IN listings
LET amenities = JSON_PARSE(doc.amenities)
LET host_verifications = doc.host_verifications == "None"
  ? []
  : JSON_PARSE(SUBSTITUTE(doc.host_verifications, "'", '"'))
UPDATE doc WITH { amenities, host_verifications } IN listings

description contains HTML markup, seen the following (possibly not exhaustive):
-  
- 
- 
- <a href="..."></a>
- <body></body>
- <html></html>
- &clubs;
- <
- &boch;
-  
- ’
host_about contains \r\n line breaks
price is always a string with a leading $ (so same currency). Converted it to a number to enable range queries, but could also be interesting to solve this at View/Analyzer level. FOR doc IN listings UPDATE doc WITH { price: TO_NUMBER(SUBSTRING(doc.price, 1)) } IN listings

Identified boolean values modeled as strings:

FOR doc IN listings
FOR attr IN ATTRIBUTES(doc, true)
  FILTER doc[attr] IN ["t","f"]
  RETURN DISTINCT attr

Updated to boolean values:

LET attrs = [
"host_has_profile_pic", // nullable
"has_availability",
"host_is_superhost", // nullable
"host_identity_verified", // nullable
"instant_bookable"
]
FOR doc IN listings
LET updated = MERGE(FOR attr IN attrs
  RETURN {[attr]: doc[attr] == "t" ? true : doc[attr] == "f" ? false : null }
)
UPDATE doc WITH updated IN listings

Added an attribute location with a GeoPoint constructed from the longitude and latitude attributes (in this order!). Left the original attributes in just in case.
```
FOR doc IN listings UPDATE doc WITH { location: GEO_POINT(doc.longitude, doc.latitude) } IN listings
```

Checked for BE spelled attribute names that contain "hood" and replace "bour" with "bor". ZIP() is used to avoid adding attributes with null values in case it wasn't present in the first place.

FOR doc IN listings
LET attrs = ATTRIBUTES(doc, true)[* FILTER CONTAINS(CURRENT, "hood")]
LET newAttrs = attrs[* RETURN SUBSTITUTE(CURRENT, "bour", "bor")]
REPLACE MERGE(UNSET(doc, attrs), ZIP(newAttrs, attrs[* RETURN doc[CURRENT]])) IN listings

reviews.csv (summary)

Renamed to reviews_summary.csv
Mapped id to _key
Converted listing_id from number to string FOR doc IN reviews_summary UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN reviews_summary
We don't need this one, it has listing_id and date only. Removed.

reviews.csv.gz

Unpacked as reviews.csv
Mapped id to _key
Converted listing_id from number to string FOR doc IN reviews UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN reviews

calendar.csv.gz

Unpacked as calendar.csv
Converted listing_id from number to string FOR doc IN calendar UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN calendar
TODO: Map listing_id to _key? Would save us an extra index, but make it less obvious which other collection this field refers to.
Converted available from string to boolean FOR doc IN calendar UPDATE doc WITH { available: doc.available == "t" ? true : false } IN calendar

Converted price and adjusted_price from $00.00 format into number.

FOR doc IN calendar
FILTER LEFT(doc.adjusted_price, 1) == "$"
UPDATE doc WITH { adjusted_price: TO_NUMBER(SUBSTRING(doc.adjusted_price, 1)) } IN calendar

neighbourhoods.csv

Can potentially be transformed into a graph

neighbourhood_group <-- neighbourhood

FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group INTO neighbourhood = doc.neighbourhood
RETURN {group, neighbourhood}

~TODO: Add geo features to the respective vertices~

Removed empty document that was created because the last line in the source file only has a stray comma
Removed the collection, because the GeoJSON source has the same data under properties.

neighbourhoods.geojson

Cannot be imported with arangoimport as is, because it's a single JSON object
Uploaded via web interface

Split into 140 features (all MultiPolygon, but apparently with a single Polygon only; between 11 and 954 coordinate pairs)

LET geo = DOCUMENT("neighbourhoods_geojson/7874966") // key may vary
FOR feat IN geo.features
INSERT feat INTO neighbourhoods_geojson_features

Two features have a null value for neighbourhood and neighbourhood_group. Two small, adjacent areas to the North-East of Berlin. A watermark? Removed them.
```
FOR f IN neighbourhoods_geojson_features
FILTER null IN [f.properties.neighbourhood, f.properties.neighbourhood_group]
RETURN f
```

Create neighborhood group vertices:

FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group
LET g = {type: "group", name: group}
INSERT g INTO neighborhoods

Create neighborhood vertices, as well as edges connecting groups and neighborhoods:

FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group INTO neighbourhoods = doc.neighbourhood
LET group_id = FIRST(FOR g IN neighborhoods FILTER g.name== group LIMIT 1 RETURN g._id)
FOR neighborhood IN neighbourhoods
  LET feat = (FOR f IN neighbourhoods_geojson_features
    FILTER group == f.properties.neighbourhood_group AND
           neighborhood == f.properties.neighbourhood
    RETURN KEEP(f, ATTRIBUTES(f, true)))
  FILTER WARN(LENGTH(feat) == 1, "expected only one matching feature")
  LET n = MERGE({type: "neighborhood", name: neighborhood}, FIRST(feat))
  LET neighborhood_id = FIRST(INSERT n INTO neighborhoods RETURN NEW._id)
  INSERT { _from: neighborhood_id, _to: group_id } INTO neighborhood_edges

~TODO: Create Berlin root node, connected to the 12 neighborhood group vertices?~
Removed the above created neighborhood graph, as it's not useful. We can still write about turning the data into a graph.
Remove the two weird neighborhoods
Renamed the collection to just neighborhoods (American English)

Renamed the attributes to AE as well

FOR doc IN neighborhoods
UPDATE doc WITH {
properties: {
  neighborhood: doc.properties.neighbourhood,
  neighborhood_group: doc.properties.neighbourhood_group
}
} IN neighborhoods OPTIONS { mergeObjects: false }

Indexing

We should add indexes for the following fields (collections), aside from what we will do with ArangoSearch:

listing_id (reviews, reviews_summary, calendar), persistent index
host_id (listings, listings_summary), persistent index
latitude, longitude (listings), geo index
amenities (listings), persistent array index

Is the conversion of ß to ss with the German text Analyzer part of the accent removal?

Dump & Restore

Notes about making the dataset available for others.

c9> ./build/bin/arangodump --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb --server.authentication false --threads 5 --collection listings --collection reviews --collection neighborhoods --collection calendar --collection _analyzers --include-system-collections dump_2021-03-12

c9> ./build/bin/arangorestore --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb_backup --create-database --server.authentication false --threads 5 --include-system-collections dump_2021-xx-xx

Dumping remotely (arangodump locally, arangod remotely) seems to be really slow. Dumping on the same machine as the server process only takes a minute on the other hand.

cw00dw0rd commented 3 years ago

We will need to change how the latitude and longitude values are being stored. They will need to either be GeoJSON, an array, or stored as sub-properties. https://www.arangodb.com/docs/devel/arangosearch-analyzers.html#geojson

I think we have enough information to make descriptive GeoJSON objects for the listings.

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [listing.latitude, listing.longitude]
  },
  "properties": {
    "name": listing.name
  }
}

What do you think?

Simran-B commented 3 years ago

Good point, we should make it a GeoJSON Point, which moves the coordinates to a sub attribute. Not sure if there's a benefit in moving the name to the GeoJSON properties field. Maybe outside of ArangoDB, when exported?

cw00dw0rd commented 3 years ago

I don't know that there is a huge benefit to putting the name in the properties. Perhaps if we were creating a separate collection of just GeoJSON objects but otherwise I don't think it will make a difference.

cw00dw0rd commented 3 years ago

Added first dump: https://drive.google.com/drive/folders/1crMM2RRpdVgi7gkblAlAZXTvIoNNVYbT?usp=sharing

cw00dw0rd commented 3 years ago

Found that price above 999 were not being included, TO_NUMBER doesn't consider a number with a , as valid and thus set 1,000+ to 0. Updated the query to take out the , and now they are included

FOR doc IN listings UPDATE doc WITH { price: TO_NUMBER(SUBSTRING(SUBSTITUTE(doc.price, ",",""), 1)) } IN listings

ArangoDB-Community / ArangoBnB

AQL: Dataset Modeling #16

Import notes

listings.csv (summary)

listings.csv.gz

reviews.csv (summary)

reviews.csv.gz

calendar.csv.gz

neighbourhoods.csv

neighbourhoods.geojson

Indexing

Dump & Restore