fox-it / flow.record

Recordization library
GNU Affero General Public License v3.0
7 stars 9 forks source link

`rdump` violates RFC4180 #72

Closed janstarke closed 1 year ago

janstarke commented 1 year ago

Hello,

I found that rdump writes csv files which do not comply to RFC4180. My very problem is that I have the output from the wer plugin of target-query, where a field might contain characters that act a field separator (e.g. a comma). In such cases, RFC4180 states that

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

I'd like to use the output of rdump in other tools, where compliance to RFC4180 would be helpful. However, implementing it would possibly break compatibility with previous version. Do you plan to implement RFC4180 resp. was it Ok if I did this? Or is compatibility more important for you? Was it a way to use https://docs.python.org/3/library/csv.html?

Regards, Jan

pyrco commented 1 year ago

Hi Jan,

We use Python's builtin csv writer which should be RFC4180 compliant. We made no changes to the default settings, so it uses the excel dialect to create the csv's.

I made a simple program to create some records and rdump seems to process them exactly as you suggest. Could you see if you get similar output when you process these records (see the files test-records.zip)?

Run

python3 ./test-records.py | rdump -m csv > test-records.csv

or

cat test-records.records | rdump -m csv > test-records.csv

If it would be possible to provide an example that fails, I would also be interested in that.

yunzheng commented 1 year ago

For "bytes" fields we do use repr() before outputting it, which of course is not RFC4180 compliant and will break if you expect the exact same data back.

I'm not sure what RFC4180 says about byte sequences that are not printable and should maybe be properly escaped.

pyrco commented 1 year ago

Encoding is outside of the RFC4180 spec, so it is up to the user to handle that. We use Python's csv default which uses repr() for bytes, which are subsequently correctly escaped/quoted, this in itself is not a violation of RFC4180.

The round trip record -> csv -> record is one that is not going to work without some effort on the side of the user parsing the csv file, but that holds for any encoding/escaping chosen for bytes fields. Similar for other field types like path which will lose some information in the round trip.

janstarke commented 1 year ago

Your input was very helpful for me to determine that I was partly wrong :-/

I was working with the output of the wer-function of target-query, which prints a dynamic number of csv-columns, dependings on the loaded modules of the very failed Windows process. The tool I'm working with complaints about an inconsistent number of columns, which I blamed to the command in the failure message of the wer entry.

Instead, rdump creates a new header every time the columns have changed, like this:

field1,field2,field3,_source,_classification,_generated,_version
AB,CD,EF,,,2023-06-07 19:38:02.132301,1
field1,field2,field3,field4,_source,_classification,_generated,_version
PQ,RS,TU,XY,,,2023-06-07 19:38:02.132315,1

Then, the csv formatted record is printed. I was wrong in the assumption that fields with comma are not delimited correctly. However, it is not easy to work with csv files having multiple header lines and a differing number of columns.

I know that rdump does not know which fields the next record might have. One way to handle this could be to store all records (yes, really) in memory to determine all field names that will occur. Another approach could be to to read the input a first time, collect all field names, and read the input second time (which will not work with streams), printing the formatted data. However, both would make rdump slow, but could create parsable csv data. How about providing fast csv (the current approach) and strict csv (the new approach)?

Another idea was to require that a function such as wer does alwas create a fixed number of fields. For example, wer prints fields which look like arrays, such as library_file[1], library_file[2] and so on, which could really be a single field with multiple values. I'm not sure if this is always possible.

yunzheng commented 1 year ago

I gave this problem a thought and I created a small proof of concept script that can write records to a SQLite3 database.

On each new RecordDescriptor it will try to create the table and add missing columns.

That way you can just write records once and when it's finished you will have all columns. You can then read the database and dump it to csv using tools such as sqlite3-utils.

Here is the script: https://gist.github.com/yunzheng/944bcd2962e0fcc12ab8037f52e5f6e6

Example usage (I used your WER patches btw):

$ target-query -q -f wer ~/Virtual\ Machines.localized/Win7_32bit.vmwarevm/ | python3 records2sqlite.py wer.db
Writing records to 'wer.db'
Wrote 8 records

Then install sqlite-utils, and show tables and dump to csv:

$ pip3 install sqlite-utils

$ sqlite-utils tables wer.db
[{"table": "filesystem/windows/wer/report"}]

$ sqlite-utils rows wer.db filesystem/windows/wer/report --csv > wer.csv

I usually use visidata for previewing large CSV files, which has a nice TUI :)

$ pip3 install visidata
$ vd wer.csv

image You can even use visidata directly on sqlite databases. It's a pretty neat tool.

Please give it a try, and let me know what you think of this solution. If this works well we can add native sqlite3 support to rdump as a flow.record adapter.

Schamper commented 1 year ago

Hi @janstarke, do you still need help with this issue or can it be closed?

janstarke commented 1 year ago

I'm still working on my implementing (in my free time, so very slowly), but your feedback helped me. This issue can be closed.

Thanks a lot, Jan