dhl123 / Airtag-2023

22 stars 9 forks source link

UDatasets preprocessed, non-tokenized log files are missing #3

Open tallyAB opened 1 week ago

tallyAB commented 1 week ago

The false positive filtering step of Airtag being employed here requires a preprocessed, non-tokenized version of the raw logs where the fields within the logs are comma-separated in order to construct a field wise frequency list for filtering. These files are missing for UDatasets which we downloaded from the following google drive link and hence they cannot be evaluated. Could you please provide us with the preprocessed, non-tokenized version of UDataset logs?

For reference, an excerpt of the preprocessed, non-tokenized version of logs from the S3 dataset looks like this: image

We have the (airtag-)tokenized version of UDatasets only. An excerpt from the U1 dataset looks like this: image

dhl123 commented 40 minutes ago

Please see the "untokenized_udatasets" floder, kind of unsure if these files are what you want. If not, please let me know.