Process information from Artifact POM files

johannesduesing commented 4 years ago

Reason for this PR According to #15, the Delphi crawler does not process any artifact information stored in the respective POM file yet. This means that potentially interesting data fields (including project name, description, etc..) are not accessible when querying Delphi. In addition to that, the publication date of an artifact is not processed either (see #37).

Changes in this PR

Extended the MavenArtifact class with optional attributes publicationDate and metadata of type ArtifactMetadata
Introduced new type ArtifactMetadata that is supposed to hold information parsed from POM files, currently name, description and system name & URL of the issueManagement
Publication date of artifacts is extracted from HTTP header in MavenDownloadActor and set accordingly
Introduced PomFileReadActor. Reads POM file for a given MavenArtifact and sets the ArtifactMetadata accordingly. Currently triggered in the MavenDiscoveryProcess as part of preprocessing. Uses Apache Xpp3Reader for POM file processing.

Open for discussion

What other attributes shall be parsed from the POM file?
Is it sensible to have POM file processing as part of the 'preprocessing', or does it belong in the 'processing' phase?
Currently, when POM processing fails, the artifact will be removed from the list of artifacts to process, ie will not be passed to Hermes. What is the desired behavior for when POM processing fails?

@bhermann , what's your opinion on these questions?

bhermann commented 4 years ago

A partial answer:

I would rather see them in the processing package than in the preprocessing package.
When POM processing fails it should not affect processing of the Java package.

johannesduesing commented 4 years ago

I fixed the two points you addressed in the latest commit. Now there is the issue of storing the data. Currently the ElasticStoreQueries trait supports storing a MavenIdentifier and a HermesResult, however the publication date and metadata is only available in the MavenArtifact class.

My plan would be to write an additional method that stores a MavenArtifact by extracting its MavenIdentifier and writing the publication date and metadata, if available, to the database (similar to what is being done for HermesResult). I would then attach this as a sink to the "Processing" stage using the .alsoTo operator, similar to the current implementation for storing MavenIdentifiers.

Do you agree with that plan? And if so, do you want me to implement the whole thing or make it a skeleton implementation until we dicussed the elastic data model changes in depth?

johannesduesing commented 4 years ago

Here's the latest update to this PR:

POM file processing now extracts the parent (optional) and packaging
POM file processing now extracts dependencies. If variables are used (e.g. ${foo.version}) they are attempted to be resolved. Resolving variables starts in the local POM, but downloads and processes parent-POMs if required and available. Same goes for dependencies without a version, the implementation will recurse through all parents to find the matching version definition. Also the scope of dependencies is being extracted.

I tested the application on my machine using a fresh elasticsearch instance (version 5.6.9), and POM file processing seems to work fine. For me, the only thing left to discuss is a suitable data model for storing the data. Using the current implementation, a search query to ElasticSearch yields the following result:

[...]
"identifier" : {
            "groupId" : "xom",
            "artifactId" : "xom",
            "version" : "1.2.5"
          },
          "discovered" : "2020-09-21T15:10:34.824+02:00",
          "published" : "2010-05-12T06:22:10.000Z",
          "pom" : {
            "parent" : "None",
            "licenses" : [
              {
                "name" : "The GNU Lesser General Public License, Version 2.1",
                "url" : "http://www.gnu.org/licenses/lgpl-2.1.html"
              }
            ],
            "issueManagement" : "None",
            "developers" : "elharo",
            "name" : "XOM",
            "description" : "The XOM Dual Streaming/Tree API for Processing XML",
            "packaging" : "jar",
            "dependencies" : [
              {
                "groupId" : "xml-apis",
                "scope" : "default",
                "artifactId" : "xml-apis",
                "version" : "1.3.03"
              },
              {
                "groupId" : "xerces",
                "scope" : "default",
                "artifactId" : "xercesImpl",
                "version" : "2.8.0"
              },
              {
                "groupId" : "xalan",
                "scope" : "default",
                "artifactId" : "xalan",
                "version" : "2.7.0"
              }
            ]
          }
        }

I am unsure whether or not this is the correct way to deal with lists (for dependencies and licenses) in ElasticSearch. @bhermann what is your opinion on that ?

sonarcloud[bot] commented 4 years ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
0.0% Duplication

johannesduesing commented 3 years ago

Closed as this functionality is now part of the redesign proposed in #50

delphi-hub / delphi-crawler

Process information from Artifact POM files #47