Closed hakamine closed 5 years ago
Hi @sromkey, I'm not sure how big is the possibility of AIPs with more than 1000 fields in the Elasticsearch index but, reading @hakamine comments, this may be a high priority issue. I'm not sure either if we're on time for 1.9.1 but if we are, this should be a "quick" fix.
@jraddaoui did this limit exist prior to 1.9?
Hi @sromkey,
No, it was introduced in Elasticsearch 2.x or 5.x, so it was not affecting older AM versions before the ES upgrade.
Workaround:
curl -XPUT 'http://localhost:9200/aips/_settings' -H "Content-Type: application/json" -d '{"index.mapping.total_fields.limit": 10000 }'
Thanks @mamedin! I'd suggest to do the same with all indexes. Using http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings
as URL may do that in a single request (NOT TESTED).
Yes it works, so this is the workaround:
curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json" -d '{"index.mapping.total_fields.limit": 10000 }'
Thanks @jraddaoui :)
@mamedin, the problem with that workaround is that it won't work with the current Django commands we have where, in some cases, the indexes are fully recreated with the settings from https://github.com/artefactual/archivematica/blob/stable/1.9.x/src/archivematicaCommon/lib/elasticSearchFunctions.py#L339-L359.
The workaround will be needed for upgrades from 1.9.0 to 1.9.1 where the re-indexing process worked, to avoid the issue in new indexed documents.
I'm working on a fix for it.
yes, it is a workaround for deployed AM.
Hi @hakamine, @sevein, @sromkey and @mamedin,
I just noted the recent changes in the indexing process in qa/1.x
from https://github.com/artefactual/archivematica/pull/1365 and that you noted in here that the issue happened to you using that branch. Looking at that PR changes, I wonder if this issue was introduced in there. We use the namespaces in the index mapping to define fields, see part of the aips
and aipfiles
indexes mapping:
https://github.com/artefactual/archivematica/tree/qa/1.x/src/archivematicaCommon/lib/elasticsearch
As you can see, we have field names like ns0:sourceMD_dict_list.ns0:mdWrap_dict_list.ns0:xmlData_dict_list
and all those fields are added by default (I noted that we should enhance those names in point 3 of #404). Because the indexes are dynamic, when new fields are added to the index because of the namespace difference, it increases a lot the amount of fields in the index.
My biggest concern for qa/1.x
is that the new field names may cause other issues, I think we reference them directly with those namespaces and that will fail for the new ones. So we should take another look to those changes before 1.10.
Nevertheless, I think that having dynamic indexes, the total fields limit fix is still a good addition for the 1.9.1 release.
Fields limit increased in stable/1.9.x and qa/1.x. I'll create a new issue to investigate the possible mapping/namespaces issues in qa/1.x.
@replaceafill found another limit being hit indexing an AIP with multiple directories:
archivematica-mcp-client_1 | RequestError(400, u'illegal_argument_exception', {u'status': 400, u'error': {u'root_cause': [{u'reason': u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]', u'type': u'illegal_argument_exception'}], u'type': u'illegal_argument_exception', u'reason': u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]'}})
archivematica-mcp-client_1 | Traceback (most recent call last):
archivematica-mcp-client_1 | File "/src/MCPClient/lib/clientScripts/index_aip.py", line 118, in call
archivematica-mcp-client_1 | status_code = index_aip(job)
archivematica-mcp-client_1 | File "/src/MCPClient/lib/clientScripts/index_aip.py", line 94, in index_aip
archivematica-mcp-client_1 | printfn=job.pyprint,
archivematica-mcp-client_1 | File "/src/archivematicaCommon/lib/elasticSearchFunctions.py", line 447, in index_aip_and_files
archivematica-mcp-client_1 | _try_to_index(client, aip_data, "aips", printfn=printfn)
archivematica-mcp-client_1 | File "/src/archivematicaCommon/lib/elasticSearchFunctions.py", line 765, in _try_to_index
archivematica-mcp-client_1 | raise exception
archivematica-mcp-client_1 | RequestError: RequestError(400, u'illegal_argument_exception', u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]')
The way the structMap
element from the METS file is parsed may create a big depth in documents for AIPs with a big directories hierarchy. We should at least increase that limit too for 1.9.1 and investigate how to improve the current indexes to better control the metadata included in them for 1.10.
Hi @jraddaoui thank you for the fix for the total fields limit. Just to clarify my note describing when this issue started to occur (hopefully it helps to pinpoint the root cause): the aips had indexed without having this error on the dev branch dev/issue-24-handle-old-aips (the snapshot here shows the AIPs in the archival storage tab), the dev branch already contained the ES 6.x upgrade changes. This error happened after the dev branch was rebased for the last time and merged to qa/1.x.
I agree with you that we need to take a look at the indexing changes in qa/1.x before 1.10. (Not sure if this may be related to this issue but) I think probably this commit (that replaces ElementTree with lxml.etree in ElasticsearchFunctions) may need to be complemented with fixes in the mapping.
Thanks @hakamine and @replaceafill. I've created https://github.com/archivematica/Issues/issues/619 to follow up with the possible mapping changes before 1.10.
This one should be ready to test in qa/1.x and stable/1.9.x.
@mamedin, to fix instances that have been already upgraded to 1.9.0, you'll need to change the body of the request to:
'{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000}'
Thanks @jraddaoui, I'll take it into account when creating the ansible-archivematica-role PR.
Hi @mamedin, please, could you review this one alongside #595?
The commit https://github.com/artefactual/archivematica/pull/1391/commits/3219a6267500c691e3de956c3820e555f9fae2de fixes the issue for new deploys or when rebuilding indices. This commit updates the indices settings when creating or rebuilding indices, but does not update these settings when the indices exist, for example when updating from AM v1.9.0.
For ansible deploys I created this PR to update the mappings for AM v1.9.0 upgrades:
https://github.com/artefactual-labs/ansible-archivematica-src/pull/242
I think a release note should be added to run the following command when upgrading from AM v1.9.0:
curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json" -d '{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000 }'
@mamedin just reading this over - does your comment above mean that artefactual-labs/ansible-archivematica-src#242 needs to be merged before we can consider this issue to be complete?
I think we can merge #242 later, this PR adds the workaround for 1.9.0:
curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json" -d '{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000 }'
Note added to release notes.
Expected behaviour AIP should be indexed without errors
Current behaviour For some AIPs, after importing, then running the
rebuild_elasticsearch_aip_index_from_files
management command getting an error like this:This error is occurring for several AIPs (5 out of 8) that were indexed before without errors while testing the code for AM issue 24 here
Your environment (version of Archivematica, OS version, etc) AM qa/1.x (df2c90d) SS qa/0.x (7c3c671)
For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle: