datasette / datasette-extract

Import unstructured data (text and images) into structured tables
Apache License 2.0
140 stars 3 forks source link

Extraction often only gets some of the data #12

Open simonw opened 6 months ago

simonw commented 6 months ago

I'm testing with data from this page: https://ogs.ny.gov/procurement/ogs-centralized-contracts-list

I pasted in this:

Award # Group   Award Title Type    Keywords
23295   20915   Furniture, All Types (Except Hospital Room and Patient Handling) (Statewide)    Commodity   Conference Furniture, Dormitory Furniture, General Purpose Tables, High Density Filing, Household Furniture, Library Furniture, Office Furniture, School Furniture, Specialty Seating, Tall Seating, Bariatric, Gang Seating, Laboratory Stools, Systems Furniture
23287   05500   Fuel Oil, Heating (Grades #2, #6 Kerosene and Bioheating Fuel) (Statewide)  Commodity   Fuel oil, Heating, Kerosene, Bioheating Fuel, Heating
23321   05900   Natural Gas (Firm Supply - Specific Locations Within National Grid Territories) Commodity   Natural Gas
23283   05800   Liquefied Petroleum Gas (LPG) - Propane (Statewide) Commodity   Cylinders, Gallons, Tanks, Installation, Testing, Inspections, LP, Liquid Propane, Butane, Isobutene
23315   01800   Road Salt, Treated Salt, & Emergency Standby Road Salt (Statewide)  Commodity   Ice, Snow, Sodium, Chloride
23272   50030   Wove & Kraft Envelopes  Commodity   Printed Envelopes, Non-Printed Envelopes
23254   40524   School Buses (Statewide)    Commodity   Bus, Conventional Bus
23241   10201   Pharmaceuticals (Individual Prescriptions) Statewide & Regional Commodity   Drugs, Pharmacists Services, Prescription Delivery, Over the Counter, OTC, Pharmaceutical Products, Medicine, Medication
23200   20600   Floor Coverings and Related Services (Statewide Piggyback)  Commodity   Carpet, Tile, Broadloom, Vinyl, LVT, Rubber Tile, Hardwood, Linoleum, Floormat, Ceramic, Installation, Padding
23238   79006   Air Travel Services (Statewide) Commodity   Plane Travel
23222   10150   Personal Protective Equipment (PPE) and Related Items (Statewide)   Commodity   Respirators, Masks, Face Shields, Goggles, Gowns, Covers, Hand Sanitizer, Wipes, Fit Test Kits, N95, Disinfecting Wipes, Surgical Mask, Alcohol Wipes, PPE
23239   01600   Milk, Fluid (Statewide) Commodity   Low Fat Milk, Reduced Fat Milk, Skim Milk, Homogenized Milk
23073   30204   Athletic Equipment (Statewide)  Commodity   Gymnasium Equipment, Physical Education Equipment, Fitness, Exercise, Elliptical, Bike, Barbell, Dumbbell, Bench, Cardiovascular, Strength Training, Stairclimber, Treadmill, Weights, Mats
23123   30310   Vehicle and Equipment Parts and Related Product (Statewide) Commodity   Light Duty Vehicle Parts, Heavy Duty Vehicle Parts, Heavy Equipment Parts, Direct Order Parts, Commonly Stocked Parts, Vehicle Cleaning Supplies, Vehicle Paint, Vehicle Tools
23204   05700   Motor Oil, Hydraulic Oil, and Diesel Exhaust Fluid (Statewide)  (Replaces 23012-RA, SW) Commodity   Motor Crankcase Oil, Hydraulic Oil, Diesel Exhaust Fluid, Refined Oil, Re-Refined Oil, Lubricating Oil, High Detergent, 5W-30, 5W-20, 10W-30, 15W-40
23149   30600   Tires, Tubes, and Services (Statewide)  Commodity    
PGB-23243   35000   Vehicle Lifts and Associated Garage Equipment Sourcewell Piggyback (Statewide)  Commodity   Garage Associated Parts, Garage Associated Supplies, Garage Associated Accessories, Vehicle Lift Installation, Vehicle Lift Repair, Vehicle Lift Maintenance
23166   40440   Vehicles, Class 1 – 8 (Statewide)   Commodity   Single OEM Vehicles, Chassis, Complete Vehicles, Car, Truck, SUV, Van, Sedan, One-ton Truck, Cargo Van
23170   40523   Buses, Transit (Adult Passenger) (Statewide)    Commodity   FTA Adult Passenger Transit Buses, Associated Transit Bus Equipment
23260   20070   Books, Serials, Databases, and Library Resource Management Products Commodity   Serials, Databases, Library Resource Management Products and Services, Printed Publications, Non-Print Library Materials, Electronic Publications, Research Support Products, Printed Periodicals, Electronic Periodicals, eBooks, Streaming Audio, Video Content, Magazines, Newspapers, Journals, Legal Research, Books, Textbooks
23185   23106   STEM / STEAM, Science Laboratory Educational Supplies And Equipment (Statewide) Commodity    
23268   32100   Snow and Ice Control Agents (Statewide) Commodity   Liquid, Calcium, Chloride, Organic, Based Performance Enhancer, OBPE, Magnesium, Corrosion, Inhibited, Treated, Salt, Flake, Pellet, Solar
23054   40061   Protective Outerwear    Commodity   Safety Shoes, Specialty Boots, Firefighting Turnout Gear, Firefighting Proximity Clothing, Wildland Clothing, Helmets, Gloves, Firefighting Boots, EMS/Search and Rescue Clothing, Bunker Gear
PGB-23197   10200   Distribution of Vaccines Including Influenza Vaccines (Statewide) (Replaces Award PGB-22797)    Commodity   General Vaccines, Seasonal Influenza Vaccines, Pharmaceuticals, Drugs, Flu Shot, COVID-19 Vaccine
PGB-23290   10200   MMCAP Infuse Influenza Vaccines (Statewide) Commodity   General Vaccines, Seasonal Influenza Vaccines, Pharmaceuticals, Drugs, Flu Shot

Configured like this:

CleanShot 2024-03-07 at 10 57 00@2x

But it only extracted the first seven:

[
  {
    "award_number": 23295,
    "group": 20915,
    "award_title": "Furniture, All Types (Except Hospital Room and Patient Handling) (Statewide)",
    "type": "Commodity",
    "keywords": "Conference Furniture, Dormitory Furniture, General Purpose Tables, High Density Filing, Household Furniture, Library Furniture, Office Furniture, School Furniture, Specialty Seating, Tall Seating, Bariatric, Gang Seating, Laboratory Stools, Systems Furniture"
  },
  {
    "award_number": 23287,
    "group": 5500,
    "award_title": "Fuel Oil, Heating (Grades #2, #6 Kerosene and Bioheating Fuel) (Statewide)",
    "type": "Commodity",
    "keywords": "Fuel oil, Heating, Kerosene, Bioheating Fuel, Heating"
  },
  {
    "award_number": 23321,
    "group": 5900,
    "award_title": "Natural Gas (Firm Supply - Specific Locations Within National Grid Territories)",
    "type": "Commodity",
    "keywords": "Natural Gas"
  },
  {
    "award_number": 23283,
    "group": 5800,
    "award_title": "Liquefied Petroleum Gas (LPG) - Propane (Statewide)",
    "type": "Commodity",
    "keywords": "Cylinders, Gallons, Tanks, Installation, Testing, Inspections, LP, Liquid Propane, Butane, Isobutene"
  },
  {
    "award_number": 23315,
    "group": 1800,
    "award_title": "Road Salt, Treated Salt, & Emergency Standby Road Salt (Statewide)",
    "type": "Commodity",
    "keywords": "Ice, Snow, Sodium, Chloride"
  },
  {
    "award_number": 23272,
    "group": 50030,
    "award_title": "Wove & Kraft Envelopes",
    "type": "Commodity",
    "keywords": "Printed Envelopes, Non-Printed Envelopes"
  },
  {
    "award_number": 23254,
    "group...,": 23239,
    "award_title": "Milk, Fluid (Statewide)",
    "type": "Commodity",
    "keywords": "Low Fat Milk, Reduced Fat Milk, Skim Milk, Homogenized Milk"
  }
]
simonw commented 6 months ago

First suspicion: there's some default number of tokens in the output that this is falling victim to.

So I added max_tokens=4096 since a few random searches seemed to hint that was the maximum.

And I got 9 instead of 7. When I pasted the output JSON into a token counter it was only 888 tokens, so nowhere near the limit.

simonw commented 6 months ago

This may need to be solved by documentation: a note on the page that warns you that it will not necessarily get everything.

This tool is going to need quite a bit of inline documentation to help people deal with its limitations.