amundsen-io / amundsen

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
https://www.amundsen.io/amundsen/
Apache License 2.0
4.45k stars 961 forks source link

AWS: Amundsen Frontend not reaching Neo4j #257

Closed lordravo closed 3 years ago

lordravo commented 4 years ago

I just followed the aws-ecs-deployment guide, and successfully launched the frontend, elasticsearch and neo4j endpoints.

After executing a dag with a BigQueryMetadataExtractor, I was also able to send metadata to Neo4j. As you can see: image

But, for some reason, nothing is reachable through the amundsen frontend. Just a blank page: image

Looking into the developer console there is a failed request: /api/search/v0/table?query=events&page_index=0 image

{"msg":"Encountered exception: HTTPConnectionPool(host='amundsensearch', port=5000): Max retries exceeded with url: /search?query_term=events&page_index=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3bf88fd470>: Failed to establish a new connection: [Errno -2] Name or service not known'))","search_term":"events","tables":{"page_index":0,"results":[],"total_results":0}}

I am not sure if it is related, but the elasticsearch endpoint returns the following:

{ "name" : "-ZjL23y", "cluster_name" : "docker-cluster", "cluster_uuid" : "yekS4uxBS76ULwUj4yp57A", "version" : { "number" : "6.7.0", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "8453f77", "build_date" : "2019-03-21T15:32:29.844721Z", "build_snapshot" : false, "lucene_version" : "7.7.0", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }

Any ideas why? Am I missing something?

jornh commented 4 years ago

It could be the “classic gotcha” that you also need to ingest data into Elasticsearch through the databuilder.

Here’s a sample of doing that: https://github.com/lyft/amundsendatabuilder/blob/v1.5.1/example/scripts/sample_data_loader.py#L590

jornh commented 4 years ago

The architecture diagram and the Search and Databuilder sections in https://github.com/lyft/amundsen/blob/master/docs/architecture.md gives a good high level overview.

lordravo commented 4 years ago

Hi @jornh, thanks for the help..

I added a create_es_publisher_sample_job to the DAG, but even so all requests on Amunsen Frontend ends up with status 500.. It seems a configuration issue perhaps?

Event the /api/auth_user endpoints returns:

{"msg":"Encountered exception: AUTH_USER_METHOD is not configured"}

And api/metadata/v0/get_last_indexed:

{"msg":"Encountered exception: HTTPConnectionPool(host='amundsenmetadata', port=5000): Max retries exceeded with url: /latest_updated_ts (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3bf88ac080>: Failed to establish a new connection: [Errno -2] Name or service not known'))","timestamp":null}

I am attaching my code anyways... sample_amundsen_v2.zip

The amundsen_databuilder_table_metadata_job task works properly, adding the metadata to Neo4J

The es_table_job task endsup with success, bit I'm not sure how to check any effects other than trying to query on Amundsen Frontend.

    es_table_job = PythonOperator(
        task_id='es_table_job',
        python_callable=create_es_publisher_sample_job,
        provide_context=True,
        op_kwargs={
            'elasticsearch_index_alias': 'table_search_index',
            'elasticsearch_doc_type_key': 'table',
            'model_name': 'databuilder.models.table_elasticsearch_document.TableESDocument'
        }
    )

Any idea?


Edit: It seems on the Elasticsearch side, everything is fine. Looking into http://host:9200/tablese8214591-bd7b-4c2f-b6d3-2eabcc8a6aa2/_search, I was able to retrieve all metadata previously sent.

Digging into the docker-ecs-amundsen.yml, there is the following environment variables on the amundsenfrontend:

I wonder if it is related, since every search cames up with:

Encountered exception: HTTPConnectionPool(host='amundsensearch', port=5000): Max retries exceeded with url: /search?query_term=test&page_index=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3bf8909278>: Failed to establish a new connection: [Errno -2] Name or service not known'))


Edit2: Looking back to my launch script, I found out some odd warnings from ecs-cli compose:

ecs-cli compose --cluster-config lg-amundsen --file docker-ecs-amundsen.yml up --create-log-groups --ecs-profile lg-amundsen [WARN] Skipping unsupported YAML option for service... [option name]=container_name [service name]=amundsenfrontend [WARN] Skipping unsupported YAML option for service... [option name]=depends_on [service name]=amundsenfrontend [WARN] Skipping unsupported YAML option for service... [option name]=container_name [service name]=neo4j [WARN] Skipping unsupported YAML option for service... [option name]=container_name [service name]=elasticsearch [WARN] Skipping unsupported YAML option for service... [option name]=container_name [service name]=amundsensearch [WARN] Skipping unsupported YAML option for service... [option name]=depends_on [service name]=amundsensearch [WARN] Skipping unsupported YAML option for service... [option name]=container_name [service name]=amundsenmetadata [WARN] Skipping unsupported YAML option for service... [option name]=depends_on [service name]=amundsenmetadata

It seems to have skipped yaml keys like: container_name and depends_on That seems bad

lordravo commented 4 years ago

I'm pretty sure it is ECS related... I just launched Amundsen on a CentOS Instance, on GoogleCloud. Exact same yml (without the aws logs), and everything is working so far.

andrey-oreshko commented 4 years ago

@lordravo hey i'm experiencing the same issue Did you come up with some solution except using CentOS?

hkuchibhotla commented 3 years ago

running into the same issue, any luck with getting this to work on ECS?