AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Data load : SA Museum #569

Open peggynewman opened 4 years ago

peggynewman commented 4 years ago

As per https://support.ehelp.edu.au/a/tickets/74060

peggynewman commented 4 years ago

Before load: SOLR: ? [DataLoader] - There are 60645 records in the file. The number of NEW records: 1904 [DataLoader] - Load finished for ornithology-SAM-DWC-CSV-Export-2002020729.csv

After process and indexing: 60,641 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Acoco127&fq=last_load_date%3A%5B2020-03-26T00%3A00%3A00Z+TO+*%5D

Records that's not processed from upload files:


institutionCode | collectionCode | catalogNumber
SAMA | Ornithology | B4293
SAMA | Ornithology | B47692
SAMA | Ornithology | B54896
SAMA | Ornithology | B4171
patkyn commented 4 years ago

Before load there were 16,902 records in solr from South Australian Museum Marine Invertebrates Collection: https://collections.ala.org.au/public/show/co165

[DataLoader] - There are 54707 records in the file. The number of NEW records: 38029 [DataLoader] - Load finished for marine-invertebrates-SAM-DWC-CSV-Export-2002020622.csv

Some error in the process

aws-bstore-3b 2020-03-30 14:17:28,091 INFO : [ProcessRecords] - 5000 >> Last key : 79170ffe-cc69-451a-b152-2538fa497481, records per sec: 993.0487
aws-bstore-3b 2020-03-30 14:17:28,170 WARN : [RecordProcessor] - Non-fatal error processing record: b93b5035-e921-40a0-9e6a-f84af4e5b5ec, processorName: loc, error: null
java.lang.NullPointerException
    at au.org.ala.biocache.processor.LocationProcessor.checkCoordinateUncertainty(LocationProcessor.scala:596)
    at au.org.ala.biocache.processor.LocationProcessor.process(LocationProcessor.scala:53)
    at au.org.ala.biocache.processor.RecordProcessor$$anonfun$processRecord$1.apply(RecordProcessor.scala:93)
    at au.org.ala.biocache.processor.RecordProcessor$$anonfun$processRecord$1.apply(RecordProcessor.scala:87)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
    at au.org.ala.biocache.processor.Processors$.foreach(Processors.scala:12)
    at au.org.ala.biocache.processor.RecordProcessor.processRecord(RecordProcessor.scala:87)
    at au.org.ala.biocache.tool.ProcessRecords$$anonfun$5$$anonfun$6.apply(ProcessRecords.scala:122)
    at au.org.ala.biocache.tool.ProcessRecords$$anonfun$5$$anonfun$6.apply(ProcessRecords.scala:116)
    at au.org.ala.biocache.util.StringConsumer.run(StringConsumer.scala:32)
aws-bstore-3b 2020-03-30 14:17:28,926 INFO : [ProcessRecords] - 6000 >> Last key : 304fad5d-a1a7-4031-b05d-a4f3b1f41a29, records per sec: 1197.6049

After process and indexing: 54,702 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco165&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D

https://biocache.ala.org.au/occurrences/b93b5035-e921-40a0-9e6a-f84af4e5b5ec shows the lat/lng is empty. Upload file marine-invertebrates-SAM-DWC-CSV-Export-2002020622.csv has empty lat/lng image

Records not processed from upload file

institutionCode | collectionCode | catalogNumber
SAMA | Marine Invertebrates | E4339
SAMA | Marine Invertebrates | D65420
SAMA | Marine Invertebrates | S2100
patkyn commented 4 years ago

Before load there were 33,965 records in solr from South Australian Museum Mammalogy Collection https://collections.ala.org.au/public/show/co126

[DataLoader] - There are 30991 records in the file. The number of NEW records: 333 [DataLoader] - Load finished for mammalogy-SAM-DWC-CSV-Export-2001300617.csv

After process and indexing: 30,990 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco126&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D

patkyn commented 4 years ago

Before load there were 17,379 records from South Australian Museum Ichthyology Collection https://collections.ala.org.au/public/show/co57

[DataLoader] - There are 16491 records in the file. The number of NEW records: 276 [DataLoader] - Load finished for ichthyology-SAM-DWC-CSV-Export-2001290609.csv

After process and indexing: 16,490 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco57&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D

patkyn commented 4 years ago

Before load there were 76,267 records in Solr from South Australian Museum Herpetology Collection: https://collections.ala.org.au/public/show/co125

[DataLoader] - There are 76669 records in the file. The number of NEW records: 1080 [DataLoader] - Load finished for herpetology-SAM-DWC-CSV-Export-2002020651.csv

Error in sampling

aws-bstore-1b 2020-03-30 22:11:39,377 INFO : [PointsReader] - Loading points from file: /data/tmp/loc-all.txt
aws-bstore-1b 2020-03-30 22:11:39,394 ERROR: [PointsReader] - Error reading point: ,
java.lang.NumberFormatException: empty String
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
    at au.org.ala.biocache.tool.PointsReader$$anonfun$loadPoints$1.apply(Sampling.scala:275)
    at au.org.ala.biocache.tool.PointsReader$$anonfun$loadPoints$1.apply(Sampling.scala:275)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at au.org.ala.biocache.tool.PointsReader.loadPoints(Sampling.scala:275)
    at au.org.ala.biocache.tool.Sampling.sampling(Sampling.scala:476)
    at au.org.ala.biocache.tool.Sampling$.main(Sampling.scala:136)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:130)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)
aws-bstore-1b 2020-03-30 22:12:12,951 INFO : [Sampling] - Total points sampled so far : 4844
aws-bstore-1b 2020-03-30 22:12:12,952 INFO : [PointsReader] - Loading points from file: /data/tmp/loc-all.txt
aws-bstore-1b 2020-03-30 22:12:12,952 INFO : [LoadSamplingConsumer] - Loading the sampling into the database: /data/tmp/sampling-all.txt
aws-bstore-1b 2020-03-30 22:12:12,969 INFO : [Sampling] - Total points sampled : 4844, output file: /data/tmp/sampling-all.txt point file: /data/tmp/loc-all.txt
aws-bstore-1b 2020-03-30 22:12:12,969 INFO : [Sampling] - ********* END - TEST BATCH SAMPLING FROM FILE ***************
aws-bstore-1b 2020-03-30 22:12:13,925 INFO : [LoadSamplingConsumer] - writing to loc:1000: records per sec: 1027.7493
aws-bstore-1b 2020-03-30 22:12:14,907 INFO : [LoadSamplingConsumer] - writing to loc:2000: records per sec: 1018.32996
aws-bstore-1b 2020-03-30 22:12:15,891 INFO : [LoadSamplingConsumer] - writing to loc:3000: records per sec: 1016.26013
aws-bstore-1b 2020-03-30 22:12:16,888 INFO : [LoadSamplingConsumer] - writing to loc:4000: records per sec: 1003.00903
aws-bstore-1b 2020-03-30 22:12:17,667 INFO : [LoadSamplingConsumer] - Finished loading: /data/tmp/sampling-all.txt in 4715ms
aws-bstore-3b 2020-03-30 22:11:37,649 INFO : [PointsReader] - Loading points from file: /data/tmp/loc-all.txt
aws-bstore-3b 2020-03-30 22:11:37,666 ERROR: [PointsReader] - Error reading point: ,
java.lang.NumberFormatException: empty String
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
    at au.org.ala.biocache.tool.PointsReader$$anonfun$loadPoints$1.apply(Sampling.scala:275)
    at au.org.ala.biocache.tool.PointsReader$$anonfun$loadPoints$1.apply(Sampling.scala:275)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at au.org.ala.biocache.tool.PointsReader.loadPoints(Sampling.scala:275)
    at au.org.ala.biocache.tool.Sampling.sampling(Sampling.scala:476)
    at au.org.ala.biocache.tool.Sampling$.main(Sampling.scala:136)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:130)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)
aws-bstore-3b 2020-03-30 22:12:11,666 INFO : [Sampling] - Total points sampled so far : 5462
aws-bstore-3b 2020-03-30 22:12:11,666 INFO : [PointsReader] - Loading points from file: /data/tmp/loc-all.txt
aws-bstore-3b 2020-03-30 22:12:11,666 INFO : [LoadSamplingConsumer] - Loading the sampling into the database: /data/tmp/sampling-all.txt
aws-bstore-3b 2020-03-30 22:12:11,688 INFO : [Sampling] - Total points sampled : 5462, output file: /data/tmp/sampling-all.txt point file: /data/tmp/loc-all.txt
aws-bstore-3b 2020-03-30 22:12:11,688 INFO : [Sampling] - ********* END - TEST BATCH SAMPLING FROM FILE ***************
aws-bstore-3b 2020-03-30 22:12:12,620 INFO : [LoadSamplingConsumer] - writing to loc:1000: records per sec: 1048.218
aws-bstore-3b 2020-03-30 22:12:13,637 INFO : [LoadSamplingConsumer] - writing to loc:2000: records per sec: 983.2842

After process and indexing: 76,663 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco125&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D

SAMA | Herpetology | R65297
SAMA | Herpetology | R31461
patkyn commented 4 years ago

Before load there are 101,644 records in Solr from South Australian Museum Terrestrial Invertebrate Collection: https://collections.ala.org.au/public/show/co56

[DataLoader] - There are 100385 records in the file. The number of NEW records: 1323 [DataLoader] - Load finished for entomology-SAM-DWC-CSV-Export-2002010647.csv

After process and indexing: 99,173 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco56&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D


institutionCode | collectionCode | catalogNumber
SAMA | Entomology | 32-017412
SAMA | Entomology | 31-000338
SAMA | Entomology | 31-010131
patkyn commented 4 years ago

Before load there are 57,317 records in Solr from South Australian Museum Arachnology Collection
https://collections.ala.org.au/public/show/co202

[DataLoader] - There are 55745 records in the file. The number of NEW records: 263 [DataLoader] - Load finished for arachnology-SAM-DWC-CSV-Export-2001300636.csv

After process and indexing: 55,739 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco202&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D

SAMA | Arachnology | NN11965
patkyn commented 4 years ago

Before load there are 120,284 records in Solr from South Australian Museum Australian Biological Tissue Collection: https://collections.ala.org.au/public/show/co166

[DataLoader] - There are 121756 records in the file. The number of NEW records: 2078 [DataLoader] - Load finished for abtc-SAM-DWC-CSV-Export-2001290731.csv

After process and indexing: 121,752 records in solr from this query https://biocache.ala.org.au/occurrences/search?q=collection_uid%3Aco166&fq=last_load_date%3A%5B2020-03-30T00%3A00%3A00Z+TO+*%5D


SAMA | AustralianBiologicalTissueCollection | ABTC34111
SAMA | AustralianBiologicalTissueCollection | ABTC16385.4
SAMA | AustralianBiologicalTissueCollection | ABTC89567
SAMA | AustralianBiologicalTissueCollection | ABTC72783.1
SAMA | AustralianBiologicalTissueCollection | ABTC26252.2
SAMA | AustralianBiologicalTissueCollection | ABTC108288
SAMA | AustralianBiologicalTissueCollection | ABTC44132