aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
119 stars 88 forks source link

XLSX output is truncated #1360

Open chinyeungli opened 3 months ago

chinyeungli commented 3 months ago

Describe the bug The description field from the JSON output is as follow:

  "description": "add and remove users and groups\n This package includes the 'adduser' and 'deluser' commands for creating\n and removing users.\n .\n  - 'adduser' creates new users and groups and adds existing users to\n    existing groups;\n  - 'deluser' removes users and groups and removes users from a given\n    group.\n .\n Adding users with 'adduser' is much easier than adding them manually.\n Adduser will choose appropriate UID and GID values, create a home\n directory, copy skeletal user configuration, and automate setting\n initial values for the user's password, real name and so on.\n .\n Deluser can back up and remove users' home directories\n and mail spool or all the files they own on the system.\n .\n A custom script can be executed after each of the commands.",

However, in the XLSX output, it got trancated in some of the newline character:

add and remove users and groups
 This package includes the 'adduser' and 'deluser' commands for creating
 and removing users.
 .
  - 'adduser' creates new users and groups and adds existing users to
tdruez commented 3 months ago

@chinyeungli The Truncated description is the expected behavior with XLSX outputs.

From https://github.com/nexB/scancode.io/blob/main/scanpipe/pipes/output.py#L384

  • Truncate the "description" field to the first five lines.
if fieldname == "description":
        max_description_lines = 5
        value = "\n".join(value.splitlines(False)[:max_description_lines])

@pombredanne Could you provide some insight on those design decisions?

@chinyeungli As you can see, the XLSX format has some limitations and is not ideal for sharing "scan data". If your main concern is data integrity, the JSON format is preferred.

Convert the value to a string and perform these adaptations:

  • Keep only unique values in lists, preserving ordering.
  • Truncate the "description" field to the first five lines.
  • Truncate any field too long to fit in an XLSX cell and report error.
  • Create a combined license expression for expressions
  • Normalize line endings
  • Truncate the value to a maximum_length supported by XLSX.