arsarabi / llm-device-fingerprints

13 stars 3 forks source link

censys snapshot is in trouble with generate_dataset.py #2

Open Ergofly opened 7 months ago

Ergofly commented 7 months ago

we download .avro files from censys and tried to convert .avro to .json

but the struct of .json seems to be unmatched with the parser. 'port' and 'softwares' are all the subrecord of 'services' in .json file.

is there any unique schema used for .avro to .json?

image image

arsarabi commented 7 months ago

Hello,

Thanks for your question! For our data, we unnested the services field before exporting the data. Here is the query we used on BigQuery:

SELECT
  host_identifier,
  ipv4_int,
  ipv6_int,
  port,
  transport,
  service_name,
  extended_service_name,
  banner,
  software,
  perspective,
  source_ip,
  truncated,
  observed_at,
  snapshot_date
FROM
  `censys-io.research_1q.universal_internet_dataset`,
  UNNEST(services)
WHERE
  DATE(snapshot_date) = "YYYY-MM-DD";

If your data is in a different format, I believe the best solution is to first convert it to JSONL files with the above format. You can then feed those into generate_dataset.py.