green-db table can not be joined with scraping table based on id

calgo-lab / green-db

The monorepo that powers the GreenDB.

https://calgo-lab.github.io/green-db/

22 stars 2 forks source link

green-db table can not be joined with scraping table based on id #78

Open BigDatalex opened 2 years ago

BigDatalex commented 2 years ago

Currently it is not possible to relate information of the scraping table to its corresponding extracted product information in the green-db table via id. If we want to join the tables we currently have to use timestamp, url and category.

We already use the id, to retrieve a specific row in the scraping table, but the id is not used any further when writing the extracted product information into the green-db, see: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/workers/workers/extract.py#L36-L39

The green-db table already has an id column, but this is autogenerated, see: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/database/database/tables.py#L203

So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from scraping table together with green-db table. For example using the HTML together with the extracted product information for some ML.

BigDatalex commented 2 years ago

Using timestamp, url and category to join the scraping table with green-db table does not work for asos, because the url in scraping table is different from the url in green-db table.

In the scraping table we store the url of the asos API from which we retrieve the product data: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/scraping/scraping/spiders/_base.py#L205

and in the green-db table we store the url of the products website: https://github.com/calgo-lab/green-db/blob/90b631bf81b7408d496534bd75d142e7c563c84d/extract/extract/extractors/asos.py#L50

BigDatalex commented 2 years ago

A workaround for asos to join both tables is to extract the product id from the API url(url which is stored in scrapingtable) and the website url(url that is stored in green-db table) and join based on this product id, timestampand category.

For example this code does the job:

scraping_asos["product_id"] = scraping_asos["url"].apply(lambda x: x.split("/")[-1].split("?")[0])
greendb_asos["product_id"] = greendb_asos["url"].apply(lambda x: x.split("/")[-1])

se-jaeger commented 2 years ago

I'm not sure if this is something we want to implement..

I could think of maintaining a "forward dependency" like a created column in the scraping database that has a foreign key to the row of the extracted product (green-db database). However, what if we manually run the extraction again? Overwrite, update, or extend (add an int to an array) this dependency?

Why not query (SQL) for the rows of interest in the scraping database and, if necessary, extract the necessary information from the HTML? I'm assuming the overhead is not the bottleneck here.

BigDatalex commented 2 years ago

Ok, I see... - if we want to keep the option to run another extraction this wouldn't work.

Then the best option might be to create an additional mapping table that maps the id of the scraping table to the id in the green-db (both being foreign keys to their respective table). This would not affect our existing table structure at all, but keep track of the corresponding rows and allow for multiple extraction runs.

And regarding running the extraction again when someone wants to use the HTML - I think this is not very user-friendly and for the older data also not easily doable, because our extractor code is not backwards compatible, so we can not extract the information from old HTML's using the current extractor implementation.

I would really appreciate such a feature and probably all others who want to use the HTML in combination with the extracted data at some point too! :)