fox-it / flow.record

Recordization library
GNU Affero General Public License v3.0
7 stars 9 forks source link

Update `elastic.py` adapter #92

Closed 0xbart closed 10 months ago

0xbart commented 10 months ago

Update elastic.py adapter, fix AttributeError if invalid uri is given, add verify_certs flag to optional arguments. Also check with hasattr if 'self.es' is set. This can be None, for example if invalid uri (such as missing port number) is given.

DissectBot commented 10 months ago

@0xbart thank you for your contribution! As this is your first code contribution, please read the following Contributor License Agreement (CLA). If you agree with the CLA, please reply with the following information:

@DissectBot agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
Contributor License Agreement

### Contribution License Agreement This Contribution License Agreement ("_Agreement_") governs your Contribution(s) (as defined below) and conveys certain license rights to Fox-IT B.V. ("_Fox-IT_") for your Contribution(s) to Fox-IT"s open source Dissect project. This Agreement covers any and all Contributions that you ("_You_" or "_Your_"), now or in the future, Submit (as defined below) to this project. This Agreement is between Fox-IT B.V. and You and takes effect when you click an “I Accept” button, check box presented with these terms, otherwise accept these terms or, if earlier, when You Submit a Contribution. 1. **Definitions.** "_Contribution_" means any original work of authorship, including any modifications or additions to an existing work, that is intentionally submitted by You to Fox-IT for inclusion in, or documentation of, any of the software products owned or managed by, or on behalf of, Fox-IT as part of the Project (the "_Work_"). "_Project_" means any of the projects owned or managed by Fox-IT and offered under a license approved by the Open Source Initiative (www.opensource.org). "_Submit_" means any form of electronic, verbal, or written communication sent to Fox-IT or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, Fox-IT for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "_Not a Contribution_." 2. **Grant of Copyright License.** Subject to the terms and conditions of this Agreement, You hereby grant to Fox-IT and to recipients of software distributed by Fox-IT a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works. 3. **Grant of Patent License.** Subject to the terms and conditions of this Agreement, You hereby grant to Fox-IT and to recipients of software distributed by Fox-IT a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, maintain, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that your Contribution, or the Work to which you have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed. 4. **Representations.** You represent that: - You are legally entitled to grant the above license. - each of Your Contributions is Your original creation (see section 8 for submissions on behalf of others). - Your Contribution submissions include complete details of any third-party license or other restriction (including, but not limited to, related patents and trademarks) of which you are personally aware and which are associated with any part of Your Contributions. 5. **Employer.** If Your Contribution is made in the course of Your work for an employer or Your employer has intellectual property rights in Your Submission by contract or applicable law, You must secure permission from Your employer to make the Contribution before signing this Agreement. In that case, the term "_You_" in this Agreement will refer to You and the employer collectively. If You change employers in the future and desire to Submit additional Contribution for the new employer, then You agree to sign a new Agreement and secure permission from the new employer before Submitting those Contributions. 6. **Support.** You are not expected to provide support for Your Contribution, unless You choose to do so. Any such support provided to the Project is provided free of charge. 7. **Warranty.** Unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. 8. **Third party material.** Should You wish to submit work that is not Your original creation, You may only submit it to Fox-IT separately from any Contribution, identifying the complete details of its source and of any license or other restriction (including, but not limited to, related patents, trademarks, and license agreements) of which You are personally aware, and conspicuously marking the work as "_Submitted on behalf of a third-party: [named here]_". 9. **Notify.** You agree to notify Fox-IT of any facts or circumstances of which You become aware that would make the above representations inaccurate in any respect. 10. **Governing law / competent court.** This Agreement is governed by the laws of the Netherlands. Any disputes that may arise are resolved by arbitration in accordance with the Arbitration Regulations of the Foundation for the Settlement of Automation Disputes (Stichting Geschillenoplossing Automatisering – SGOA – (www.sgoa.eu), this without prejudice to either party"s right to request preliminary relief in preliminary relief proceedings or arbitral preliminary relief proceedings. Arbitration proceedings take place in Amsterdam, or in any other place designated in the Arbitration Regulations. Arbitration shall take place in English.

0xbart commented 10 months ago

@DissectBot agree

0xbart commented 10 months ago

I also have another option to introduce, called hash_record. The data can be ingested multiple times, such as caused by an HTTP error. An option to prevent this is to make the _id unique. One of the ways to do this is by using the hashlib.md5 function on the document["_source"] variable.

Passing this option is a bit slow, but I think useful to introduce. Interesting to hear other opinions.

Example code:

    def record_to_document(self, record: Record, index: str) -> dict:
        """Convert a record to a Elasticsearch compatible document dictionary"""
        rdict = record._asdict()

        (...)

        # Check if hash_record is set and hash record to md5 if flag is set
        if self.hash_record:
            document["_id"] = hashlib.md5(
                document["_source"].encode()
                # self.json_packer.pack(record._asdict()).encode().  
                # ^ This is also an option if `target-query` is used again
            ).hexdigest()

        return document

Oneliner to test: rdump <records> -w 'elastic+https://@127.0.0.1:9200?verify_certs=0&hash_record=1

yunzheng commented 10 months ago

I also have another option to introduce, called hash_record. The data can be ingested multiple times, such as caused by an HTTP error. An option to prevent this is to make the _id unique. One of the ways to do this is by using the hashlib.md5 function on the document["_source"] variable.

Passing this option is a bit slow, but I think useful to introduce. Interesting to hear other opinions.

It's definitely a nice idea to have a unique (deterministic) identifier for a record, and could also be useful for other ingestors to avoid duplication. Do note that the _generated timestamp field will make the record non deterministic (unless you replay it from archived records), but if this is excluded from the hashing you would have a deterministic identifier.

It would be better if this deterministic id generation would happen in the core of flow.record, but i'm not sure yet what the best way would be to achieve this.

So for now, I think it's fine to have this feature in the elastic adapter. Feel free to add to this PR or create a new one.

0xbart commented 10 months ago

@yunzheng Thanks for your review! I've updated the PR with the hash_record flag.

The reason why I've chosen for the hasattr() is because an Exception is raised during the Exception. See example of the stack trace:

Traceback (most recent call last):
  File "/home/user/.virtualenvs/dissect/src/flow-record/flow/record/tools/rdump.py", line 239, in <module>
    sys.exit(main())
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/utils.py", line 57, in wrapper
    return func(*args, **kwargs)
  File "/home/user/.virtualenvs/dissect/src/flow-record/flow/record/tools/rdump.py", line 209, in main
    with RecordWriter(uri) as record_writer:
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/base.py", line 864, in RecordWriter
    return RecordAdapter(url=url, out=True, clobber=clobber, **kwargs)
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/base.py", line 851, in RecordAdapter
    return cls(cls_url, **arg_dict)
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/adapter/elastic.py", line 49, in __init__
    self.es = elasticsearch.Elasticsearch(uri, verify_certs=verify_certs, http_compress=http_compress)
  File "/home/user/.local/share/virtualenvs/dissect/lib/python3.10/site-packages/elasticsearch/_sync/client/__init__.py", line 333, in __init__
    node_configs = client_node_configs(
  File "/home/user/.local/share/virtualenvs/dissect/lib/python3.10/site-packages/elasticsearch/_sync/client/utils.py", line 108, in client_node_configs
    node_configs = hosts_to_node_configs(hosts)
  File "/home/user/.local/share/virtualenvs/dissect/lib/python3.10/site-packages/elasticsearch/_sync/client/utils.py", line 146, in hosts_to_node_configs
    return hosts_to_node_configs([hosts])
  File "/home/user/.local/share/virtualenvs/dissect/lib/python3.10/site-packages/elasticsearch/_sync/client/utils.py", line 154, in hosts_to_node_configs
    node_configs.append(url_to_node_config(host))
  File "/home/user/.local/share/virtualenvs/dissect/lib/python3.10/site-packages/elastic_transport/client_utils.py", line 216, in url_to_node_config
    raise ValueError(
ValueError: URL must include a 'scheme', 'host', and 'port' component (ie 'https://localhost:9200')

Exception ignored in: <function AbstractWriter.__del__ at 0x7fd97ad0d750>
Traceback (most recent call last):
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/adapter/__init__.py", line 39, in __del__
    self.close()
  File "/home/user/.local/share/virtualenvs/dissect/src/flow-record/flow/record/adapter/elastic.py", line 119, in close
    self.queue.put(StopIteration)
AttributeError: 'ElasticWriter' object has no attribute 'queue'
yunzheng commented 10 months ago

There are some linting errors, you can fix them with tox -e fix

0xbart commented 10 months ago

There are some linting errors, you can fix them with tox -e fix

Done.

yunzheng commented 10 months ago

https://github.com/fox-it/flow.record/blob/5cb258d0f9f963d5a44dfa9f85feaaa0d00c55a5/flow/record/adapter/elastic.py#L72-L76

Can you add the following between line 75 and 76:

        # remove _generated field from metadata to ensure determinstic documents
        if self.hash_record:
            rdict_meta.pop("generated", None)

This will ensure that the record document is deterministic by removing the _generated field when hash_record is set.

codecov[bot] commented 10 months ago

Codecov Report

Merging #92 (7fb7288) into main (fdcecba) will decrease coverage by 0.43%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
- Coverage   79.36%   78.94%   -0.43%     
==========================================
  Files          32       32              
  Lines        2947     2963      +16     
==========================================
  Hits         2339     2339              
- Misses        608      624      +16     
Flag Coverage Δ
unittests 78.94% <0.00%> (-0.43%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
flow/record/adapter/elastic.py 0.00% <0.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more