deegree / deegree3

Official deegree repository providing geospatial core libraries, data access and advanced OGC web service implementations
https://www.deegree.org
GNU Lesser General Public License v2.1
146 stars 99 forks source link

Improve performance of the deegree GmlLoader #1570

Open julianzz98 opened 1 year ago

julianzz98 commented 1 year ago

While using the GmlLoader of the deegree GML tools CLI, it came to my attention that the GmlLoader gets slower the more features it is processing.

The used dataset:

Description of the dataset (german-only):


Command using the deegree SqlFeatureStoreConfigCreator:

java -jar deegree-tools-gml.jar SqlFeatureStoreConfigCreator -format=all -idtype=uuid -mapping=relational -dialect=postgis -schemaUrl=https://inspire.ec.europa.eu/schemas/so/4.0/Soil.xsd

Command using the deegree GmlLoader:

java -Xmx16g -jar deegree-tools-gml.jar GmlLoader -pathToFile=/Soil.gml -workspaceName=inspire_soil -sqlFeatureStoreId=inspire_soil_fs -disabledResources=https://inspire.ec.europa.eu/codelist/

End of the GmlLoader logs (after it successfully ran):

[...]
2023-08-30 03:42:57 [main] WARN org.deegree.feature.Features - Unable to resolve external reference '#SoilLayer_1300681451000100_30_0'. Ignoring.
2023-08-30 03:42:57 [main] WARN org.deegree.feature.Features - Unable to resolve external reference '#SoilLayer_1300681451000100_30_0'. Ignoring.
2023-08-30 03:42:57 [main] WARN org.deegree.feature.Features - Unable to resolve external reference '#OM_Observation_1300681451000100_HUMUS_DM'. Ignoring.
2023-08-30 03:42:57 [main] WARN org.deegree.feature.Features - Unable to resolve external reference '#OM_Observation_1300681451000100_HUMUS_DM'. Ignoring.
2023-08-30 03:42:57 [main] INFO o.d.t.f.loader.FeatureStoreWriter - Insert performed.
2023-08-30 03:42:57 [main] INFO o.d.t.f.loader.FeatureStoreWriter - Insert performed.
2023-08-30 03:42:57 [main] INFO o.d.t.f.loader.FeatureStoreWriter - Number of features processed: 1060212
2023-08-30 03:42:57 [main] INFO o.d.t.f.loader.FeatureStoreWriter - Number of features processed: 1060212
2023-08-30 08:14:19 [main] INFO o.d.t.f.loader.TransactionHandler - Commit transaction.
2023-08-30 08:14:19 [main] INFO o.d.t.f.loader.TransactionHandler - Commit transaction.
2023-08-30 08:14:22 [main] INFO o.s.batch.core.step.AbstractStep - Step: [gmlLoaderStep] executed in 21h28m14s832ms
2023-08-30 08:14:22 [main] INFO o.s.batch.core.step.AbstractStep - Step: [gmlLoaderStep] executed in 21h28m14s832ms
2023-08-30 08:14:22 [main] INFO o.s.batch.core.step.AbstractStep - Step: [gmlLoaderStep] executed in 21h28m14s832ms
2023-08-30 08:14:22 [main] INFO o.d.t.f.loader.GmlReader - Finished file Soil.gml
2023-08-30 08:14:22 [main] INFO o.d.t.f.loader.GmlReader - Finished file Soil.gml
2023-08-30 08:14:22 [main] INFO o.d.t.f.loader.GmlReader - Finished file Soil.gml
2023-08-30 08:14:22 [main] INFO o.d.t.f.loader.GmlReader - Finished file Soil.gml
2023-08-30 08:14:24 [main] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=gmlLoaderJob]] completed with the following parameters: [{pathToFile=/Soil.gml, workspaceName=inspire_soil_final, run.id=1, disabledResources=https://inspire.ec.europa.eu/codelist/, sqlFeatureStoreId=inspire_soil_fs}] and the following status: [COMPLETED] in 21h28m14s910ms
2023-08-30 08:14:24 [main] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=gmlLoaderJob]] completed with the following parameters: [{pathToFile=/Soil.gml, workspaceName=inspire_soil_final, run.id=1, disabledResources=https://inspire.ec.europa.eu/codelist/, sqlFeatureStoreId=inspire_soil_fs}] and the following status: [COMPLETED] in 21h28m14s910ms
2023-08-30 08:14:24 [main] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=gmlLoaderJob]] completed with the following parameters: [{pathToFile=/Soil.gml, workspaceName=inspire_soil_final, run.id=1, disabledResources=https://inspire.ec.europa.eu/codelist/, sqlFeatureStoreId=inspire_soil_fs}] and the following status: [COMPLETED] in 21h28m14s910ms

Observation: The final import of the data takes around ~4,5 hours, as seen in the logs.

stephanr commented 1 year ago

Can you provide more details for that operation that took so long for that particular 850 MB dataset? For example used type of dataset/scheme (if possible/public), used database, was the dataset compressed and options of the loader command line would help to describe the issue better if it may is related to the data.

In the past I had similar issues and created a (partial) workaround (in #1454) for issue #1451, did you use the flag -skipReferenceCheck=true in your operation loading these 850 MB? If this was not used in your process it would be interesting to know if this option would make a notable difference in time.

This may also be related to:

julianzz98 commented 1 year ago

Thanks for providing the existing references on the topic. I was planning on providing more details and fill the issue later today. Will comment here again once I am done!

julianzz98 commented 1 year ago

@stephanr I updated the description of the issue with the needed information of usage as well as the dataset. If I repeat the operation, I will make sure to try out the flag -skipReferenceCheck=true!