censys snapshot is in trouble with generate_dataset.py

Hello,

Thanks for your question! For our data, we unnested the services field before exporting the data. Here is the query we used on BigQuery:

SELECT
  host_identifier,
  ipv4_int,
  ipv6_int,
  port,
  transport,
  service_name,
  extended_service_name,
  banner,
  software,
  perspective,
  source_ip,
  truncated,
  observed_at,
  snapshot_date
FROM
  `censys-io.research_1q.universal_internet_dataset`,
  UNNEST(services)
WHERE
  DATE(snapshot_date) = "YYYY-MM-DD";

If your data is in a different format, I believe the best solution is to first convert it to JSONL files with the above format. You can then feed those into generate_dataset.py.

arsarabi / llm-device-fingerprints

censys snapshot is in trouble with generate_dataset.py #2