delphi-hub / delphi-crawler

Delphi's crawling and processing engine to extract facts on open-source software
https://delphi.cs.uni-paderborn.de/
Apache License 2.0
5 stars 11 forks source link

Process information from Artifact POM files #47

Closed johannesduesing closed 3 years ago

johannesduesing commented 4 years ago

Reason for this PR According to #15, the Delphi crawler does not process any artifact information stored in the respective POM file yet. This means that potentially interesting data fields (including project name, description, etc..) are not accessible when querying Delphi. In addition to that, the publication date of an artifact is not processed either (see #37).

Changes in this PR

Open for discussion

@bhermann , what's your opinion on these questions?

bhermann commented 4 years ago

A partial answer:

johannesduesing commented 4 years ago

I fixed the two points you addressed in the latest commit. Now there is the issue of storing the data. Currently the ElasticStoreQueries trait supports storing a MavenIdentifier and a HermesResult, however the publication date and metadata is only available in the MavenArtifact class.

My plan would be to write an additional method that stores a MavenArtifact by extracting its MavenIdentifier and writing the publication date and metadata, if available, to the database (similar to what is being done for HermesResult). I would then attach this as a sink to the "Processing" stage using the .alsoTo operator, similar to the current implementation for storing MavenIdentifiers.

Do you agree with that plan? And if so, do you want me to implement the whole thing or make it a skeleton implementation until we dicussed the elastic data model changes in depth?

johannesduesing commented 4 years ago

Here's the latest update to this PR:

I tested the application on my machine using a fresh elasticsearch instance (version 5.6.9), and POM file processing seems to work fine. For me, the only thing left to discuss is a suitable data model for storing the data. Using the current implementation, a search query to ElasticSearch yields the following result:

[...]
"identifier" : {
            "groupId" : "xom",
            "artifactId" : "xom",
            "version" : "1.2.5"
          },
          "discovered" : "2020-09-21T15:10:34.824+02:00",
          "published" : "2010-05-12T06:22:10.000Z",
          "pom" : {
            "parent" : "None",
            "licenses" : [
              {
                "name" : "The GNU Lesser General Public License, Version 2.1",
                "url" : "http://www.gnu.org/licenses/lgpl-2.1.html"
              }
            ],
            "issueManagement" : "None",
            "developers" : "elharo",
            "name" : "XOM",
            "description" : "The XOM Dual Streaming/Tree API for Processing XML",
            "packaging" : "jar",
            "dependencies" : [
              {
                "groupId" : "xml-apis",
                "scope" : "default",
                "artifactId" : "xml-apis",
                "version" : "1.3.03"
              },
              {
                "groupId" : "xerces",
                "scope" : "default",
                "artifactId" : "xercesImpl",
                "version" : "2.8.0"
              },
              {
                "groupId" : "xalan",
                "scope" : "default",
                "artifactId" : "xalan",
                "version" : "2.7.0"
              }
            ]
          }
        }

I am unsure whether or not this is the correct way to deal with lists (for dependencies and licenses) in ElasticSearch. @bhermann what is your opinion on that ?

sonarcloud[bot] commented 4 years ago

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

johannesduesing commented 3 years ago

Closed as this functionality is now part of the redesign proposed in #50