Open BigDatalex opened 2 years ago
Using timestamp
, url
and category
to join the scraping
table with green-db
table does not work for asos, because the url
in scraping
table is different from the url
in green-db
table.
In the scraping
table we store the url
of the asos API from which we retrieve the product data:
https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/scraping/scraping/spiders/_base.py#L205
and in the green-db
table we store the url
of the products website:
https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/extract/extract/extractors/asos.py#L50
A workaround for asos to join both tables is to extract the product id
from the API url
(url which is stored in scraping
table) and the website url
(url that is stored in green-db
table) and join based on this product id
, timestamp
and category
.
For example this code does the job:
scraping_asos["product_id"] = scraping_asos["url"].apply(lambda x: x.split("/")[-1].split("?")[0])
greendb_asos["product_id"] = greendb_asos["url"].apply(lambda x: x.split("/")[-1])
I'm not sure if this is something we want to implement..
I could think of maintaining a "forward dependency" like a created
column in the scraping database that has a foreign key to the row of the extracted product (green-db
database). However, what if we manually run the extraction again? Overwrite, update, or extend (add an int to an array) this dependency?
Why not query (SQL) for the rows of interest in the scraping database and, if necessary, extract the necessary information from the HTML? I'm assuming the overhead is not the bottleneck here.
Ok, I see... - if we want to keep the option to run another extraction this wouldn't work.
Then the best option might be to create an additional mapping table that maps the id
of the scraping
table to the id
in the green-db
(both being foreign keys to their respective table). This would not affect our existing table structure at all, but keep track of the corresponding rows and allow for multiple extraction runs.
And regarding running the extraction again when someone wants to use the HTML - I think this is not very user-friendly and for the older data also not easily doable, because our extractor code is not backwards compatible, so we can not extract the information from old HTML's using the current extractor implementation.
I would really appreciate such a feature and probably all others who want to use the HTML in combination with the extracted data at some point too! :)
Currently it is not possible to relate information of the
scraping
table to its corresponding extracted product information in thegreen-db
table viaid
. If we want to join the tables we currently have to usetimestamp
,url
andcategory
.We already use the
id
, to retrieve a specific row in thescraping
table, but theid
is not used any further when writing the extracted product information into thegreen-db
, see: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/workers/workers/extract.py#L36-L39The
green-db
table already has anid
column, but this is autogenerated, see: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/database/database/tables.py#L203So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from
scraping table
together withgreen-db
table. For example using the HTML together with the extracted product information for some ML.