Open Ergofly opened 7 months ago
Hello,
Thanks for your question! For our data, we unnested the services field before exporting the data. Here is the query we used on BigQuery:
SELECT
host_identifier,
ipv4_int,
ipv6_int,
port,
transport,
service_name,
extended_service_name,
banner,
software,
perspective,
source_ip,
truncated,
observed_at,
snapshot_date
FROM
`censys-io.research_1q.universal_internet_dataset`,
UNNEST(services)
WHERE
DATE(snapshot_date) = "YYYY-MM-DD";
If your data is in a different format, I believe the best solution is to first convert it to JSONL files with the above format. You can then feed those into generate_dataset.py
.
we download .avro files from censys and tried to convert .avro to .json
but the struct of .json seems to be unmatched with the parser. 'port' and 'softwares' are all the subrecord of 'services' in .json file.
is there any unique schema used for .avro to .json?