Boavizta / boaviztapi

🛠 Giving access to BOAVIZTA reference data and methodologies trough a RESTful API
GNU Affero General Public License v3.0
66 stars 21 forks source link

INTEL parser #132

Closed da-ekchajzer closed 9 months ago

da-ekchajzer commented 1 year ago

Intel data could be exported from : https://ark.intel.com/content/www/us/en/ark.html

csauge commented 1 year ago

Seems that we need to scrap the intel web site and this seems not easy. Alternatively there are these possible sources:

da-ekchajzer commented 1 year ago

Nice datasets !

csauge commented 1 year ago

I have tried to scrap the techpowerup site and it seems to work. We need to parse all 3000 (AMD and Intel) processor pages to get such detailed information but this is easy. I think we can have a script of 50 lines of codes to extract a csv with all the CPUs. This is possible also to scrap for GPUs and SSDs. Nevertheless, I do not know if the list of CPU is exhaustive.

Here is an example for one CPU: https://www.techpowerup.com/cpu-specs/ryzen-5-3600.c2132

Header: ['Socket', 'Foundry', 'Process Size', 'Transistors', 'Die Size', 'I/O Process Size', 'I/O Die Size', 'Package', 'tCaseMax', 'Market', 'Production Status', 'Release Date', 'Launch Price', 'Part#', 'Bundled Cooler', 'Frequency', 'Turbo Clock', 'Base Clock', 'Multiplier', 'Multiplier Unlocked', 'TDP', 'FP32', 'Codename', 'Generation', 'Memory Support', 'ECC Memory', 'PCI-Express', 'Chipsets', '# of Cores', '# of Threads', 'SMP # CPUs', 'Integrated Graphics', 'Cache L1', 'Cache L2', 'Cache L3', 'Features']

Value: ['AMD Socket AM4', 'TSMC', '7 nm', '3,800 million', '74 mm²', '12 nm', '124 mm²', 'µOPGA-1331', '95°C', 'Desktop', 'Active', 'Jul 7th, 2019', '$199', '100-000000031', 'Wraith Stealth', '3.6 GHz', 'up to 4.2 GHz', '100 MHz', '36.0x', 'Yes', '65 W', '1,209.6 GFLOPS', 'Matisse', 'Ryzen 5 (Zen 2 (Matisse))', 'DDR4-3200 MHz Dual-channel', 'No', 'Gen 4, 16 Lanes(CPU only)', 'AMD 300 Series, AMD 400 Series, AMD 500 Series', '6', '12', '1', 'N/A', '64K (per core)', '512K (per core)', '32MB (shared)', 'MMX SSE SSE2 SSE3 SSSE3 SSE4A SSE4.1 SSE4.2 AES AVX AVX2 BMI1 BMI2 SHA F16C FMA3 AMD64 EVP AMD-V SMAP SMEP SMT Precision Boost 2']

da-ekchajzer commented 1 year ago

I don't know if it is exhaustive, but it is definitely more complete than our current dataset. If we could scrap all those processors characteristics, it would be wonderful.

The only data manipulation that should be done is the extraction of the CPU family - here zen 2 from Ryzen 5 (Zen 2 (Matisse). The list of family can be found here https://github.com/Boavizta/boaviztapi/blob/v0.3/boaviztapi/data/crowdsourcing/cpu_manufacture.csv.

What are you using for scrapping the pages ? Maybe we could do it inside a notebook in the repo to help reproduce the scrapping.

I also see that they have GPU (https://www.techpowerup.com/gpu-specs/) and SSD (https://www.techpowerup.com/ssd-specs/) specs. It will be very helpful in the future.

We should be aware that tech power might block our scrapping (see https://www.reddit.com/r/datasets/comments/y6isgi/scraping_gpus_from_techpowerupcom_using_python/).

csauge commented 1 year ago

I am currently using BeautifulSoup to scrap with rotating IP, changing http headers and set a random interval to send requests. It works ok but i have the captcha every 50 requests... I think i need a scrapping expert to do this faster... I can provide a csv after that.

da-ekchajzer commented 1 year ago

@AirLoren any thoughts from our scrapping expert ?

csauge commented 1 year ago

Fix is done with https://github.com/Boavizta/boaviztapi/pull/170 and https://github.com/Boavizta/boaviztapi-utils/commit/ac0ca06fa22d9ccee37fd2a29050821e8a1068c9 Ticket to close