calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Clean Amazon URLs scraped before #100 in GreenDB #123

Open BigDatalex opened 1 year ago

BigDatalex commented 1 year ago

As part of PR #100 the URL storing logic from amazon was improved, so that unnecessary tracking information is no more stored using the following function: https://github.com/calgo-lab/green-db/blob/7ab12c99ef0a6360540dbcc7129fa757f6235312/scraping/scraping/utils.py#L15-L22

However, the URLs of the products in the DB until that point are not affected by this change and still include the tracking information. This interferes with the deduplication and publishing process of the GreenDB as this process relies on the URLs. We need to clean the products in the DB from amazon, which still include the tracking information.