archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: AIP index error: ''Limit of total fields [1000] in index [aips] has been exceeded'' #608

Closed hakamine closed 5 years ago

hakamine commented 5 years ago

Expected behaviour AIP should be indexed without errors

Current behaviour For some AIPs, after importing, then running the rebuild_elasticsearch_aip_index_from_files management command getting an error like this:

/usr/share/archivematica/virtualenvs/archivematica-dashboard/bin/python manage.py rebuild_elasticsearch_aip_index_from_files /mnt/aipstore/ --uuid a3af2073-2c1f-4ec2-a24d-ac8a9de17851
Rebuilding AIP UUID a3af2073-2c1f-4ec2-a24d-ac8a9de17851
Processing AIP a3af2073-2c1f-4ec2-a24d-ac8a9de17851
Command to run: ['unar', '-force-overwrite', '-o', '/tmp/tmpUDKs3O', '/mnt/aipstore/a3af/2073/2c1f/4ec2/a24d/ac8a/9de1/785
1/1970-03-03-a3af2073-2c1f-4ec2-a24d-ac8a9de17851.7z', '1970-03-03-a3af2073-2c1f-4ec2-a24d-ac8a9de17851/data/METS.a3af2073
-2c1f-4ec2-a24d-ac8a9de17851.xml']
/mnt/aipstore/a3af/2073/2c1f/4ec2/a24d/ac8a/9de1/7851/1970-03-03-a3af2073-2c1f-4ec2-a24d-ac8a9de17851.7z: 7-Zip
  1970-03-03-a3af2073-2c1f-4ec2-a24d-ac8a9de17851/data/METS.a3af2073-2c1f-4ec2-a24d-ac8a9de17851.xml  (3005137 B)... OK.
Successfully extracted to "/tmp/tmpUDKs3O/1970-03-03-a3af2073-2c1f-4ec2-a24d-ac8a9de17851".
AIP UUID: a3af2073-2c1f-4ec2-a24d-ac8a9de17851
Indexing AIP ...
Removed FITS output from METS.
ERROR: error trying to index.
RequestError(400, u'illegal_argument_exception', u'Limit of total fields [1000] in index [aips] has been exceeded')
ERROR: error trying to index.
RequestError(400, u'illegal_argument_exception', u'Limit of total fields [1000] in index [aips] has been exceeded'
...
^CTraceback (most recent call last):
  File "manage.py", line 10, in 
    execute_from_command_line(sys.argv)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/managem
ent/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/managem
ent/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/managem
ent/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/managem
ent/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/opt/archivematica/archivematica/src/dashboard/src/main/management/commands/rebuild_elasticsearch_aip_index_from_f
iles.py", line 298, in handle
    delete_existing_data=options["delete"],
  File "/opt/archivematica/archivematica/src/dashboard/src/main/management/commands/rebuild_elasticsearch_aip_index_from_f
iles.py", line 171, in processAIPThenDeleteMETSFile
    identifiers=[],  # TODO get these
  File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 442, in index_aip_and_files
    _try_to_index(client, aip_data, "aips", printfn=printfn)
  File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 755, in _try_to_index
    time.sleep(wait_between_tries)
KeyboardInterrupt

This error is occurring for several AIPs (5 out of 8) that were indexed before without errors while testing the code for AM issue 24 here

Your environment (version of Archivematica, OS version, etc) AM qa/1.x (df2c90d) SS qa/0.x (7c3c671)


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

jraddaoui commented 5 years ago

Hi @sromkey, I'm not sure how big is the possibility of AIPs with more than 1000 fields in the Elasticsearch index but, reading @hakamine comments, this may be a high priority issue. I'm not sure either if we're on time for 1.9.1 but if we are, this should be a "quick" fix.

sromkey commented 5 years ago

@jraddaoui did this limit exist prior to 1.9?

jraddaoui commented 5 years ago

Hi @sromkey,

No, it was introduced in Elasticsearch 2.x or 5.x, so it was not affecting older AM versions before the ES upgrade.

mamedin commented 5 years ago

Workaround:

curl -XPUT 'http://localhost:9200/aips/_settings' -H "Content-Type: application/json"  -d '{"index.mapping.total_fields.limit": 10000 }'
jraddaoui commented 5 years ago

Thanks @mamedin! I'd suggest to do the same with all indexes. Using http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings as URL may do that in a single request (NOT TESTED).

mamedin commented 5 years ago

Yes it works, so this is the workaround:

curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json"  -d '{"index.mapping.total_fields.limit": 10000 }'

Thanks @jraddaoui :)

jraddaoui commented 5 years ago

@mamedin, the problem with that workaround is that it won't work with the current Django commands we have where, in some cases, the indexes are fully recreated with the settings from https://github.com/artefactual/archivematica/blob/stable/1.9.x/src/archivematicaCommon/lib/elasticSearchFunctions.py#L339-L359.

The workaround will be needed for upgrades from 1.9.0 to 1.9.1 where the re-indexing process worked, to avoid the issue in new indexed documents.

I'm working on a fix for it.

mamedin commented 5 years ago

yes, it is a workaround for deployed AM.

jraddaoui commented 5 years ago

Hi @hakamine, @sevein, @sromkey and @mamedin,

I just noted the recent changes in the indexing process in qa/1.x from https://github.com/artefactual/archivematica/pull/1365 and that you noted in here that the issue happened to you using that branch. Looking at that PR changes, I wonder if this issue was introduced in there. We use the namespaces in the index mapping to define fields, see part of the aips and aipfiles indexes mapping:

https://github.com/artefactual/archivematica/tree/qa/1.x/src/archivematicaCommon/lib/elasticsearch

As you can see, we have field names like ns0:sourceMD_dict_list.ns0:mdWrap_dict_list.ns0:xmlData_dict_list and all those fields are added by default (I noted that we should enhance those names in point 3 of #404). Because the indexes are dynamic, when new fields are added to the index because of the namespace difference, it increases a lot the amount of fields in the index.

My biggest concern for qa/1.x is that the new field names may cause other issues, I think we reference them directly with those namespaces and that will fail for the new ones. So we should take another look to those changes before 1.10.

Nevertheless, I think that having dynamic indexes, the total fields limit fix is still a good addition for the 1.9.1 release.

jraddaoui commented 5 years ago

Fields limit increased in stable/1.9.x and qa/1.x. I'll create a new issue to investigate the possible mapping/namespaces issues in qa/1.x.

jraddaoui commented 5 years ago

@replaceafill found another limit being hit indexing an AIP with multiple directories:

archivematica-mcp-client_1       | RequestError(400, u'illegal_argument_exception', {u'status': 400, u'error': {u'root_cause': [{u'reason': u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]', u'type': u'illegal_argument_exception'}], u'type': u'illegal_argument_exception', u'reason': u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]'}})
archivematica-mcp-client_1       | Traceback (most recent call last):
archivematica-mcp-client_1       |   File "/src/MCPClient/lib/clientScripts/index_aip.py", line 118, in call
archivematica-mcp-client_1       |     status_code = index_aip(job)
archivematica-mcp-client_1       |   File "/src/MCPClient/lib/clientScripts/index_aip.py", line 94, in index_aip
archivematica-mcp-client_1       |     printfn=job.pyprint,
archivematica-mcp-client_1       |   File "/src/archivematicaCommon/lib/elasticSearchFunctions.py", line 447, in index_aip_and_files
archivematica-mcp-client_1       |     _try_to_index(client, aip_data, "aips", printfn=printfn)
archivematica-mcp-client_1       |   File "/src/archivematicaCommon/lib/elasticSearchFunctions.py", line 765, in _try_to_index
archivematica-mcp-client_1       |     raise exception
archivematica-mcp-client_1       | RequestError: RequestError(400, u'illegal_argument_exception', u'Limit of mapping depth [20] in index [aips] has been exceeded due to object field [mets.mets:mets_dict_list.mets:structMap_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list.mets:div_dict_list]')

The way the structMap element from the METS file is parsed may create a big depth in documents for AIPs with a big directories hierarchy. We should at least increase that limit too for 1.9.1 and investigate how to improve the current indexes to better control the metadata included in them for 1.10.

hakamine commented 5 years ago

Hi @jraddaoui thank you for the fix for the total fields limit. Just to clarify my note describing when this issue started to occur (hopefully it helps to pinpoint the root cause): the aips had indexed without having this error on the dev branch dev/issue-24-handle-old-aips (the snapshot here shows the AIPs in the archival storage tab), the dev branch already contained the ES 6.x upgrade changes. This error happened after the dev branch was rebased for the last time and merged to qa/1.x.

I agree with you that we need to take a look at the indexing changes in qa/1.x before 1.10. (Not sure if this may be related to this issue but) I think probably this commit (that replaces ElementTree with lxml.etree in ElasticsearchFunctions) may need to be complemented with fixes in the mapping.

jraddaoui commented 5 years ago

Thanks @hakamine and @replaceafill. I've created https://github.com/archivematica/Issues/issues/619 to follow up with the possible mapping changes before 1.10.

This one should be ready to test in qa/1.x and stable/1.9.x.

jraddaoui commented 5 years ago

@mamedin, to fix instances that have been already upgraded to 1.9.0, you'll need to change the body of the request to:

'{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000}'

mamedin commented 5 years ago

Thanks @jraddaoui, I'll take it into account when creating the ansible-archivematica-role PR.

jraddaoui commented 5 years ago

Hi @mamedin, please, could you review this one alongside #595?

mamedin commented 5 years ago

The commit https://github.com/artefactual/archivematica/pull/1391/commits/3219a6267500c691e3de956c3820e555f9fae2de fixes the issue for new deploys or when rebuilding indices. This commit updates the indices settings when creating or rebuilding indices, but does not update these settings when the indices exist, for example when updating from AM v1.9.0.

For ansible deploys I created this PR to update the mappings for AM v1.9.0 upgrades:

https://github.com/artefactual-labs/ansible-archivematica-src/pull/242

I think a release note should be added to run the following command when upgrading from AM v1.9.0:

curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json"  -d '{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000 }'
sallain commented 5 years ago

@mamedin just reading this over - does your comment above mean that artefactual-labs/ansible-archivematica-src#242 needs to be merged before we can consider this issue to be complete?

mamedin commented 5 years ago

I think we can merge #242 later, this PR adds the workaround for 1.9.0:

curl -XPUT 'http://localhost:9200/aips,aipfiles,transfers,transferfiles/_settings' -H "Content-Type: application/json" -d '{"index.mapping.total_fields.limit": 10000, "index.mapping.depth.limit": 1000 }'

sallain commented 5 years ago

Note added to release notes.