Closed janstarke closed 1 year ago
Hi Jan,
We use Python's builtin csv writer which should be RFC4180 compliant. We made no changes to the default settings, so it uses the excel dialect to create the csv's.
I made a simple program to create some records and rdump seems to process them exactly as you suggest. Could you see if you get similar output when you process these records (see the files test-records.zip)?
Run
python3 ./test-records.py | rdump -m csv > test-records.csv
or
cat test-records.records | rdump -m csv > test-records.csv
If it would be possible to provide an example that fails, I would also be interested in that.
For "bytes" fields we do use repr()
before outputting it, which of course is not RFC4180 compliant and will break if you expect the exact same data back.
I'm not sure what RFC4180 says about byte sequences that are not printable and should maybe be properly escaped.
Encoding is outside of the RFC4180 spec, so it is up to the user to handle that. We use Python's csv
default which uses repr()
for bytes
, which are subsequently correctly escaped/quoted, this in itself is not a violation of RFC4180.
The round trip record -> csv -> record is one that is not going to work without some effort on the side of the user parsing the csv file, but that holds for any encoding/escaping chosen for bytes fields. Similar for other field types like path
which will lose some information in the round trip.
Your input was very helpful for me to determine that I was partly wrong :-/
I was working with the output of the wer
-function of target-query
, which prints a dynamic number of csv
-columns, dependings on the loaded modules of the very failed Windows process. The tool I'm working with complaints about an inconsistent number of columns, which I blamed to the command in the failure message of the wer
entry.
Instead, rdump
creates a new header every time the columns have changed, like this:
field1,field2,field3,_source,_classification,_generated,_version
AB,CD,EF,,,2023-06-07 19:38:02.132301,1
field1,field2,field3,field4,_source,_classification,_generated,_version
PQ,RS,TU,XY,,,2023-06-07 19:38:02.132315,1
Then, the csv formatted record is printed. I was wrong in the assumption that fields with comma are not delimited correctly. However, it is not easy to work with csv files having multiple header lines and a differing number of columns.
I know that rdump
does not know which fields the next record might have. One way to handle this could be to store all records (yes, really) in memory to determine all field names that will occur. Another approach could be to to read the input a first time, collect all field names, and read the input second time (which will not work with streams), printing the formatted data. However, both would make rdump
slow, but could create parsable csv data. How about providing fast csv (the current approach) and strict csv (the new approach)?
Another idea was to require that a function such as wer does alwas create a fixed number of fields. For example, wer prints fields which look like arrays, such as library_file[1]
, library_file[2]
and so on, which could really be a single field with multiple values. I'm not sure if this is always possible.
I gave this problem a thought and I created a small proof of concept script that can write records to a SQLite3 database.
On each new RecordDescriptor it will try to create the table and add missing columns.
That way you can just write records once and when it's finished you will have all columns.
You can then read the database and dump it to csv
using tools such as sqlite3-utils
.
Here is the script: https://gist.github.com/yunzheng/944bcd2962e0fcc12ab8037f52e5f6e6
Example usage (I used your WER patches btw):
$ target-query -q -f wer ~/Virtual\ Machines.localized/Win7_32bit.vmwarevm/ | python3 records2sqlite.py wer.db
Writing records to 'wer.db'
Wrote 8 records
Then install sqlite-utils, and show tables and dump to csv:
$ pip3 install sqlite-utils
$ sqlite-utils tables wer.db
[{"table": "filesystem/windows/wer/report"}]
$ sqlite-utils rows wer.db filesystem/windows/wer/report --csv > wer.csv
I usually use visidata for previewing large CSV files, which has a nice TUI :)
$ pip3 install visidata
$ vd wer.csv
You can even use visidata
directly on sqlite databases. It's a pretty neat tool.
Please give it a try, and let me know what you think of this solution.
If this works well we can add native sqlite3 support to rdump
as a flow.record
adapter.
Hi @janstarke, do you still need help with this issue or can it be closed?
I'm still working on my implementing (in my free time, so very slowly), but your feedback helped me. This issue can be closed.
Thanks a lot, Jan
Hello,
I found that
rdump
writes csv files which do not comply to RFC4180. My very problem is that I have the output from thewer
plugin oftarget-query
, where a field might contain characters that act a field separator (e.g. a comma). In such cases, RFC4180 states thatI'd like to use the output of
rdump
in other tools, where compliance to RFC4180 would be helpful. However, implementing it would possibly break compatibility with previous version. Do you plan to implement RFC4180 resp. was it Ok if I did this? Or is compatibility more important for you? Was it a way to use https://docs.python.org/3/library/csv.html?Regards, Jan