clarin-eric / VLO

Virtual Language Observatory
GNU General Public License v3.0
14 stars 6 forks source link

No title for *some* records within same collection and profile #147

Closed twagoo closed 6 years ago

twagoo commented 6 years ago

Compare "Portuguese newspaper subcorpus from 2013 (por_news_2013_1M)" and "Unnamed record", both from the Leipzig Corpora Collection and based on the same profile (LCC_CorpusProfile).

The latter of these two appears without a name even though it has a value in the LCC_Corpus/Name element just like the former.

Relates to Trac #1045.

teckart commented 6 years ago

Another record with this problem: http://hdl.handle.net/11234/1-1508@format=cmdi (https://vlo.clarin.eu/record?docId=http_58__47__47_hdl.handle.net_47_11234_47_1-1508_64_format_61_cmdi)

twagoo commented 6 years ago

Another record with this problem: http://hdl.handle.net/11234/1-1508@format=cmdi (https://vlo.clarin.eu/record?docId=http_58__47__47_hdl.handle.net_47_11234_47_1-1508_64_format_61_cmdi)

Note that this was reported by @stranak, and automatically turned into a ticket in the CLARIN-D support system. If we solve this or relevant information becomes apparent, we should report back there.

twagoo commented 6 years ago

A subsequent import at vlo.clarin.eu fixed the problematic records. So somehow this mapping mistake/omission seems to have been incidental somehow. I have no suggestions for further investigation but we should keep an eye on this.

Problematic import was started on 2018-02-20 at 01:19 CET Import that fixed the state was started on 2018-02-21 at 13:48 CET

twagoo commented 6 years ago

However, other records have a missing value for name now! See search results. This leads me to think it could be a concurrency issue.

stranak commented 6 years ago

OK, good to know it is not on our side.

Pavel

On 21 Feb 2018, at 14:27, Twan Goosen notifications@github.com wrote:

However, other records have a missing value for name now! See search results https://vlo.clarin.eu/search?q=-name:*&fqType=collection:or&fq=collection:Leipzig+Corpora+Collection. This leads me to think it could be a concurrency issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clarin-eric/VLO/issues/147#issuecomment-367325253, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWponDmo3s9lnbXa6Ew4kPBO1MNLyCGks5tXBnJgaJpZM4SNYfX.

twagoo commented 6 years ago

Configuring the importer to run only a single processing thread:

<fileProcessingThreads>1</fileProcessingThreads>

takes away the issue. This is a strong indicator that this is a concurrency issue. Next step: see if this can be reproduced with older versions of the VLO.

twagoo commented 6 years ago

An import on alpha-vlo.clarin.eu confirms that d7a43d75311a70e12e4d03175239d22f2579a833 fixes the issue. Will include this in a hotfix release which will be VLO 4.3.6 (beta deployment asap).

twagoo commented 6 years ago

Note: beta currently has ~145k records without a title in its index. Reporting back after first import with vlo-4.3.6-beta1.

twagoo commented 6 years ago

Note: beta currently has ~145k records without a title in its index. Reporting back after first import with vlo-4.3.6-beta1.

As of this morning the number of results for -name:* is 62452 on beta. This confirms the fix.