Closed cw00dw0rd closed 3 years ago
There's in fact a newer Berlin dataset from 2020-12-21: http://insideairbnb.com/get-the-data.html It even comes with a neighborhood GeoJSON file!
Using the Berlin, Germany dataset from 21 December, 2020
listings_summary.csv
id
to _key
FOR doc IN listings_summary
FOR a IN ATTRIBUTES(doc, true)
COLLECT attr = a WITH COUNT INTO count
RETURN {attr, count}
Most attributes occur in all 20,224 documents. Only host_name
, last_review
and reviews_per_month
are absent in some of them.
listings
collection (sample):
FOR ls IN listings_summary
LIMIT 100
FOR l IN listings
FILTER l._key == ls._key
LET diff = MERGE(FOR attr IN ATTRIBUTES(ls, true)
FILTER ls[attr] != l[attr] RETURN {[attr]: [l[attr], ls[attr]]})
RETURN diff
neighbourhood
is often missing or "Berlin, Germany" in
listings. I suspect it's a free-text field that hosts can fill in, and the clean data appears to be in
neighbourhood_cleansed`.neighbourhood_group
is (always?) missing in listings
. The data seems to be stored in neighbourhood_group_cleansed
instead.price
is a formatted string with currency symbol in listings
, but a number in listings_summary
.FOR doc IN listings_summary
REPLACE MERGE(
UNSET(doc, ["neighbourhood_group", "neighbourhood"]),
{ neighborhood_group: doc.neighbourhood_group, neighborhood: doc.neighbourhood }
) IN listings_summary
listings.csv
id
to _key
amenities
and host_verifications
to arrays of strings. Mapped "None"
to []
for the latter, but we might want to drop the entire attribute if it has no value or give it an explicit null
value.
FOR doc IN listings
LET amenities = JSON_PARSE(doc.amenities)
LET host_verifications = doc.host_verifications == "None"
? []
: JSON_PARSE(SUBSTITUTE(doc.host_verifications, "'", '"'))
UPDATE doc WITH { amenities, host_verifications } IN listings
description
contains HTML markup, seen the following (possibly not exhaustive):
<br />
<p></p>
<b></b>
<a href="..."></a>
<body></body>
<html></html>
♣
<
&boch;
’
host_about
contains \r\n
line breaksprice
is always a string with a leading $
(so same currency). Converted it to a number to enable range queries, but could also be interesting to solve this at View/Analyzer level.
FOR doc IN listings UPDATE doc WITH { price: TO_NUMBER(SUBSTRING(doc.price, 1)) } IN listings
FOR doc IN listings
FOR attr IN ATTRIBUTES(doc, true)
FILTER doc[attr] IN ["t","f"]
RETURN DISTINCT attr
Updated to boolean values:
LET attrs = [
"host_has_profile_pic", // nullable
"has_availability",
"host_is_superhost", // nullable
"host_identity_verified", // nullable
"instant_bookable"
]
FOR doc IN listings
LET updated = MERGE(FOR attr IN attrs
RETURN {[attr]: doc[attr] == "t" ? true : doc[attr] == "f" ? false : null }
)
UPDATE doc WITH updated IN listings
location
with a GeoPoint constructed from the longitude
and latitude
attributes (in this order!). Left the original attributes in just in case.
FOR doc IN listings UPDATE doc WITH { location: GEO_POINT(doc.longitude, doc.latitude) } IN listings
ZIP()
is used to avoid adding attributes with null
values in case it wasn't present in the first place.
FOR doc IN listings
LET attrs = ATTRIBUTES(doc, true)[* FILTER CONTAINS(CURRENT, "hood")]
LET newAttrs = attrs[* RETURN SUBSTITUTE(CURRENT, "bour", "bor")]
REPLACE MERGE(UNSET(doc, attrs), ZIP(newAttrs, attrs[* RETURN doc[CURRENT]])) IN listings
reviews_summary.csv
id
to _key
listing_id
from number to string
FOR doc IN reviews_summary UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN reviews_summary
listing_id
and date
only. Removed.reviews.csv
id
to _key
listing_id
from number to string
FOR doc IN reviews UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN reviews
Unpacked as calendar.csv
Converted listing_id
from number to string
FOR doc IN calendar UPDATE doc WITH { listing_id: TO_STRING(doc.listing_id) } IN calendar
TODO: Map listing_id
to _key
? Would save us an extra index, but make it less obvious which other collection this field refers to.
Converted available
from string to boolean
FOR doc IN calendar UPDATE doc WITH { available: doc.available == "t" ? true : false } IN calendar
Converted price
and adjusted_price
from $00.00
format into number.
FOR doc IN calendar
FILTER LEFT(doc.adjusted_price, 1) == "$"
UPDATE doc WITH { adjusted_price: TO_NUMBER(SUBSTRING(doc.adjusted_price, 1)) } IN calendar
Can potentially be transformed into a graph
neighbourhood_group
<-- neighbourhood
FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group INTO neighbourhood = doc.neighbourhood
RETURN {group, neighbourhood}
Removed empty document that was created because the last line in the source file only has a stray comma
Removed the collection, because the GeoJSON source has the same data under properties
.
LET geo = DOCUMENT("neighbourhoods_geojson/7874966") // key may vary
FOR feat IN geo.features
INSERT feat INTO neighbourhoods_geojson_features
null
value for neighbourhood
and neighbourhood_group
. Two small, adjacent areas to the North-East of Berlin. A watermark? Removed them.
FOR f IN neighbourhoods_geojson_features
FILTER null IN [f.properties.neighbourhood, f.properties.neighbourhood_group]
RETURN f
FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group
LET g = {type: "group", name: group}
INSERT g INTO neighborhoods
FOR doc IN neighbourhoods
COLLECT group = doc.neighbourhood_group INTO neighbourhoods = doc.neighbourhood
LET group_id = FIRST(FOR g IN neighborhoods FILTER g.name== group LIMIT 1 RETURN g._id)
FOR neighborhood IN neighbourhoods
LET feat = (FOR f IN neighbourhoods_geojson_features
FILTER group == f.properties.neighbourhood_group AND
neighborhood == f.properties.neighbourhood
RETURN KEEP(f, ATTRIBUTES(f, true)))
FILTER WARN(LENGTH(feat) == 1, "expected only one matching feature")
LET n = MERGE({type: "neighborhood", name: neighborhood}, FIRST(feat))
LET neighborhood_id = FIRST(INSERT n INTO neighborhoods RETURN NEW._id)
INSERT { _from: neighborhood_id, _to: group_id } INTO neighborhood_edges
neighborhoods
(American English)FOR doc IN neighborhoods
UPDATE doc WITH {
properties: {
neighborhood: doc.properties.neighbourhood,
neighborhood_group: doc.properties.neighbourhood_group
}
} IN neighborhoods OPTIONS { mergeObjects: false }
We should add indexes for the following fields (collections), aside from what we will do with ArangoSearch:
listing_id
(reviews, reviews_summary, calendar), persistent indexhost_id
(listings, listings_summary), persistent indexlatitude
, longitude
(listings), geo indexamenities
(listings), persistent array indexIs the conversion of ß
to ss
with the German text Analyzer part of the accent removal?
Notes about making the dataset available for others.
c9> ./build/bin/arangodump --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb --server.authentication false --threads 5 --collection listings --collection reviews --collection neighborhoods --collection calendar --collection _analyzers --include-system-collections dump_2021-03-12
c9> ./build/bin/arangorestore --server.endpoint tcp://127.0.0.1:9929 --server.database arangobnb_backup --create-database --server.authentication false --threads 5 --include-system-collections dump_2021-xx-xx
Dumping remotely (arangodump locally, arangod remotely) seems to be really slow. Dumping on the same machine as the server process only takes a minute on the other hand.
We will need to change how the latitude
and longitude
values are being stored.
They will need to either be GeoJSON, an array, or stored as sub-properties.
https://www.arangodb.com/docs/devel/arangosearch-analyzers.html#geojson
I think we have enough information to make descriptive GeoJSON objects for the listings.
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [listing.latitude, listing.longitude]
},
"properties": {
"name": listing.name
}
}
What do you think?
Good point, we should make it a GeoJSON Point, which moves the coordinates to a sub attribute. Not sure if there's a benefit in moving the name to the GeoJSON properties
field. Maybe outside of ArangoDB, when exported?
I don't know that there is a huge benefit to putting the name in the properties
. Perhaps if we were creating a separate collection of just GeoJSON objects but otherwise I don't think it will make a difference.
Found that price above 999
were not being included, TO_NUMBER
doesn't consider a number with a ,
as valid and thus set 1,000
+ to 0
. Updated the query to take out the ,
and now they are included
FOR doc IN listings UPDATE doc WITH { price: TO_NUMBER(SUBSTRING(SUBSTITUTE(doc.price, ",",""), 1)) } IN listings
The current dataset we intend to use is here https://www.kaggle.com/brittabettendorf/berlin-airbnb-data?select=listings_summary.csv
This may need to change if we find it isn't able to fulfill some of the features we require for the project. However, the initial investigation looks promising.
We will need to model the data and once that is done we will update the readme to contain the working dataset.
Feel free to contribute to this as well if you have a suggested/preferred approach.