Closed fmatthies closed 3 months ago
@ChristophB: don't know how excessive a change this would be, but right now in EntityForm the expression-input for a CompositeConcept dictates that Constant are excluded. However, the Distance function requires as its second parameter a constant (or rather an integer).
We could remove this exclusion from the expression-input component. There is currently no way to check if an argument is applicable to a function, so constants will be available everywhere in the concept expression editor.
We could remove this exclusion from the expression-input component. There is currently no way to check if an argument is applicable to a function, so constants will be available everywhere in the concept expression editor.
Some more adjustments are needed here. Constants are not limited to integers.
Edit: Done in 1583fda4
@fmatthies / @ChristophB: need to look into status text, as it doesn't seem to update within the Dialog. If it's closed and opened again, all is well.
This is still hard for me to reproduce without a proper document server and graph API. The status is derived from the computed value graphPipelineStatus
. My guess is that pipeline responses might be missing the status value.
@fmatthies / @ChristophB: I added a Build query button to Documents page. (see picture 1) I thought that it would be nice to jump into the Queries page with the Data Source selected from where one jumped. However, before I make it functional, I wanted to see/check if this is something that goes along with the design choice because a Repository needs to be selected as well
Currently, the button just redirects to the query builder. I think this is fine for now. And there are also two things to consider here:
@fmatthies / @ChristophB: when coming from Query results (Show data set) the left menu is not updated (see picture 2) and clicking on Documents resets the url but still shows the query results (one needs to hit F5 or change to another TOP page and back)
I think, the cause is that there is no data source selected if the page is accessed via "show data set". The desired data source can be propagated via the property dataSource
of the DocumentSearchForm
component or retrieved from the query object.
Okay, thanks for the heads up!
@fmatthies / @ChristophB: I added a Build query button to Documents page. (see picture 1) I thought that it would be nice to jump into the Queries page with the Data Source selected from where one jumped. However, before I make it functional, I wanted to see/check if this is something that goes along with the design choice because a Repository needs to be selected as well
Currently, the button just redirects to the query builder. I think this is fine for now. And there are also two things to consider here:
1. The document search page is adapter-based and completely unrelated to repositories. So we have now way for the user to select a repository before clicking the button. 2. Repositories might now be configured for all data adapters. It is unclear if an adapter-repository combination is allowed to perform queries with.
When clicking the button, the dataSource
will be sent to QueryBuilder
as well. Therein setRepository
checks whether the data source is configured for this particular repo-organisation. If not, a notify
warning is shown to inform the user. Otherwise, the dataSource
is pre-selected. (a441570)
@fmatthies / @ChristophB: when coming from Query results (Show data set) the left menu is not updated (see picture 2) and clicking on Documents resets the url but still shows the query results (one needs to hit F5 or change to another TOP page and back)
I think, the cause is that there is no data source selected if the page is accessed via "show data set". The desired data source can be propagated via the property
dataSource
of theDocumentSearchForm
component or retrieved from the query object.
Hm, I just checked. I already implemented that. The second picture shows the dataSource
is successfully read out, as well.
@fmatthies / @ChristophB: need to look into status text, as it doesn't seem to update within the Dialog. If it's closed and opened again, all is well.
This is still hard for me to reproduce without a proper document server and graph API. The status is derived from the computed value
graphPipelineStatus
. My guess is that pipeline responses might be missing the status value.
Not just necessarily for this issue, but in general for the backend (or better yet the adapters):
The two datasource configs I use are attached (added a txt
ext because github
did not allow for yml
to be uploaded)
Test_Data_Source_1.yml.txt
Test_Data_Source_2.yml.txt
Task 4 might be fixed now.
Still can't reproduce pipelines locally. I don't have documents.
Elasticsearch
runs under 0.0.0.0:9008 and the index is documents
if you need test documents.
This doesn't really help. When I try to run it with a local instance, the following error is raised:
{"error":"Couldn't find graph pickle 'test_data_source_3_graph.pickle'. Probably steps before failed; check the logs.","name":"test_data_source_3"} -- 500 Internal Server Error from GET http://localhost:9007/graph/statistics?process=Test_Data_Source_3
But initially response to startConceptGraphPipeline(...)
has SUCCESSFUL as status. There seems to be something off with either backend or concept-graphs. Another request for the pipeline status is necessary to get the correct FAILED status.
Okay, thanks. I look into it. Could you provide me with your setup? Could be that I didn't update the concept-graphs-api on top-prod. I always used my local instance thereof to be able to debug it.
I use a local instance as well, with branch "issues_4_5_improvements".
clicking on
Documents
resets the url but still shows the query results (one needs to hit F5 or change to another TOP page and back)
This is kind of intentional, because clicking on "Documents" has the same effect as clearing the query result. In both cases, the data source remains selected. Should we leave it this way?
This doesn't really help. When I try to run it with a local instance, the following error is raised:
{"error":"Couldn't find graph pickle 'test_data_source_3_graph.pickle'. Probably steps before failed; check the logs.","name":"test_data_source_3"} -- 500 Internal Server Error from GET http://localhost:9007/graph/statistics?process=Test_Data_Source_3
But initially response to
startConceptGraphPipeline(...)
has SUCCESSFUL as status. There seems to be something off with either backend or concept-graphs. Another request for the pipeline status is necessary to get the correct FAILED status.
Can you see the logs of the concept-graphs-api
? Normally, this error indicates that some step in the graph creation process failed (one of process documents
, embed phrases
, clustering
, or graph creation
)?
This is the output of concept graphs:
concept-graphs-api-1 | INFO:main:Using process name 'test_data_source_3' concept-graphs-api-1 | [2024-07-17 11:42:24,109] INFO in main_methods: Using process name 'test_data_source_3' concept-graphs-api-1 | INFO:main:Using preset language settings for 'de' concept-graphs-api-1 | [2024-07-17 11:42:24,110] INFO in main_methods: Using preset language settings for 'de' concept-graphs-api-1 | INFO:main:Skipping present saved steps concept-graphs-api-1 | [2024-07-17 11:42:24,110] INFO in main_methods: Skipping present saved steps concept-graphs-api-1 | INFO:main:Reading config (data) ... concept-graphs-api-1 | [2024-07-17 11:42:24,136] INFO in main_methods: Reading config (data) ... concept-graphs-api-1 | INFO:main:No config file provided; using default values concept-graphs-api-1 | [2024-07-17 11:42:24,136] INFO in preprocessing_util: No config file provided; using default values concept-graphs-api-1 | INFO:main:Parsed the following arguments for <preprocessing_util.PreprocessingUtil object at 0x7f2378f83850>: concept-graphs-api-1 | {'spacy_model': 'de_dep_news_trf', 'file_encoding': 'utf-8', 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | [2024-07-17 11:42:24,136] INFO in main_methods: Parsed the following arguments for <preprocessing_util.PreprocessingUtil object at 0x7f2378f83850>: concept-graphs-api-1 | {'spacy_model': 'de_dep_news_trf', 'file_encoding': 'utf-8', 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | INFO:main:Labels will be extracted from the document server if the field 'label' is present. concept-graphs-api-1 | [2024-07-17 11:42:24,137] INFO in preprocessing_util: Labels will be extracted from the document server if the field 'label' is present. concept-graphs-api-1 | INFO:main:Reading config (embedding) ... concept-graphs-api-1 | [2024-07-17 11:42:24,137] INFO in main_methods: Reading config (embedding) ... concept-graphs-api-1 | INFO:main:No config file provided; using default values concept-graphs-api-1 | [2024-07-17 11:42:24,137] INFO in embedding_util: No config file provided; using default values concept-graphs-api-1 | INFO:main:Parsed the following arguments for <embedding_util.PhraseEmbeddingUtil object at 0x7f23790c3430>: concept-graphs-api-1 | {'model': 'Sahajtomar/German-semantic', 'down_scale_algorithm': None, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | [2024-07-17 11:42:24,137] INFO in main_methods: Parsed the following arguments for <embedding_util.PhraseEmbeddingUtil object at 0x7f23790c3430>: concept-graphs-api-1 | {'model': 'Sahajtomar/German-semantic', 'down_scale_algorithm': None, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | INFO:main:Reading config (clustering) ... concept-graphs-api-1 | [2024-07-17 11:42:24,138] INFO in main_methods: Reading config (clustering) ... concept-graphs-api-1 | INFO:main:No config file provided; using default values concept-graphs-api-1 | [2024-07-17 11:42:24,138] INFO in clustering_util: No config file provided; using default values concept-graphs-api-1 | INFO:main:Parsed the following arguments for <clustering_util.ClusteringUtil object at 0x7f23790c3310>: concept-graphs-api-1 | {'algorithm': 'kmeans', 'downscale': 'umap', 'scaling_n_neighbors': 10, 'scaling_min_dist': 0.1, 'scaling_n_components': 100, 'scaling_metric': 'euclidean', 'scaling_random_state': 42, 'kelbow_k': (10, 100), 'kelbow_show': False, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | [2024-07-17 11:42:24,138] INFO in main_methods: Parsed the following arguments for <clustering_util.ClusteringUtil object at 0x7f23790c3310>: concept-graphs-api-1 | {'algorithm': 'kmeans', 'downscale': 'umap', 'scaling_n_neighbors': 10, 'scaling_min_dist': 0.1, 'scaling_n_components': 100, 'scaling_metric': 'euclidean', 'scaling_random_state': 42, 'kelbow_k': (10, 100), 'kelbow_show': False, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | INFO:main:Reading config (graph) ... concept-graphs-api-1 | [2024-07-17 11:42:24,139] INFO in main_methods: Reading config (graph) ... concept-graphs-api-1 | INFO:main:No config file provided; using default values concept-graphs-api-1 | [2024-07-17 11:42:24,140] INFO in graph_creation_util: No config file provided; using default values concept-graphs-api-1 | INFO:main:Parsed the following arguments for <graph_creation_util.GraphCreationUtil object at 0x7f23790c3cd0>: concept-graphs-api-1 | {'cluster_distance': 0.7, 'cluster_min_size': 4, 'graph_cosine_weight': 0.6, 'graph_merge_threshold': 0.95, 'graph_weight_cut_off': 0.5, 'graph_unroll': False, 'graph_simplify': 0.5, 'graph_simplify_alg': 'significance', 'graph_sub_clustering': False, 'restrict_to_cluster': True, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | [2024-07-17 11:42:24,140] INFO in main_methods: Parsed the following arguments for <graph_creation_util.GraphCreationUtil object at 0x7f23790c3cd0>: concept-graphs-api-1 | {'cluster_distance': 0.7, 'cluster_min_size': 4, 'graph_cosine_weight': 0.6, 'graph_merge_threshold': 0.95, 'graph_weight_cut_off': 0.5, 'graph_unroll': False, 'graph_simplify': 0.5, 'graph_simplify_alg': 'significance', 'graph_sub_clustering': False, 'restrict_to_cluster': True, 'corpus_name': 'test_data_source_3'} concept-graphs-api-1 | /usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. concept-graphs-api-1 | _torch_pytree._register_pytree_node( concept-graphs-api-1 | WARNING:root:There are trigger types in the termset that are not expected by negspacy and won't be processed: {'none', 'preceding_speculation', 'following_speculation'} concept-graphs-api-1 | [2024-07-17 11:42:27,426] WARNING in negation: There are trigger types in the termset that are not expected by negspacy and won't be processed: {'none', 'preceding_speculation', 'following_speculation'} 100%|██████████| 2938/2938 [04:43<00:00, 10.38it/s] concept-graphs-api-1 | INFO:root:Creating Sentence Embedding with 'None' concept-graphs-api-1 | [2024-07-17 11:47:15,844] INFO in embedding_functions: Creating Sentence Embedding with 'None' concept-graphs-api-1 | INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: Sahajtomar/German-semantic
concept-graphs-api-1 | [2024-07-17 11:47:15,844] INFO in SentenceTransformer: Load pretrained SentenceTransformer: Sahajtomar/German-semanti concept-graphs-api-1 | INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu concept-graphs-api-1 | [2024-07-17 11:47:54,135] INFO in SentenceTransformer: Use pytorch device_name: cpu concept-graphs-api-1 | Saved under: /rest_api/tmp/test_data_source_3/test_data_source_3_data.pickle Batches: 100%|██████████| 187/187 [02:50<00:00, 1.10it/s] concept-graphs-api-1 | INFO:root:Building Concept Cluster ... concept-graphs-api-1 | [2024-07-17 11:50:54,372] INFO in cluster_functions: Building Concept Cluster ... concept-graphs-api-1 | INFO:root:UMAP arguments: {'a': None, 'angular_rp_forest': False, 'b': None, 'dens_frac': 0.3, 'dens_lambda': 2.0, 'dens_var_shift': 0.1, 'densmap': False, 'disconnection_distance': None, 'force_approximation_algorithm': False, 'init': 'spectral', 'learning_rate': 1.0, 'local_connectivity': 1.0, 'low_memory': True, 'metric': 'euclidean', 'metric_kwds': None, 'min_dist': 0.1, 'n_components': 100, 'n_epochs': None, 'n_jobs': -1, 'n_neighbors': 10, 'negative_sample_rate': 5, 'output_dens': False, 'output_metric': 'euclidean', 'output_metric_kwds': None, 'precomputed_knn': (None, None, None), 'random_state': 42, 'repulsion_strength': 1.0, 'set_op_mix_ratio': 1.0, 'spread': 1.0, 'target_metric': 'categorical', 'target_metric_kwds': None, 'target_n_neighbors': -1, 'target_weight': 0.5, 'tqdm_kwds': None, 'transform_mode': 'embedding', 'transform_queue_size': 4.0, 'transform_seed': 42, 'unique': False, 'verbose': False} concept-graphs-api-1 | [2024-07-17 11:50:54,386] INFO in cluster_functions: UMAP arguments: {'a': None, 'angular_rp_forest': False, 'b': None, 'dens_frac': 0.3, 'dens_lambda': 2.0, 'dens_var_shift': 0.1, 'densmap': False, 'disconnection_distance': None, 'force_approximation_algorithm': False, 'init': 'spectral', 'learning_rate': 1.0, 'local_connectivity': 1.0, 'low_memory': True, 'metric': 'euclidean', 'metric_kwds': None, 'min_dist': 0.1, 'n_components': 100, 'n_epochs': None, 'n_jobs': -1, 'n_neighbors': 10, 'negative_sample_rate': 5, 'output_dens': False, 'output_metric': 'euclidean', 'output_metric_kwds': None, 'precomputed_knn': (None, None, None), 'random_state': 42, 'repulsion_strength': 1.0, 'set_op_mix_ratio': 1.0, 'spread': 1.0, 'target_metric': 'categorical', 'target_metric_kwds': None, 'target_n_neighbors': -1, 'target_weight': 0.5, 'tqdm_kwds': None, 'transform_mode': 'embedding', 'transform_queue_size': 4.0, 'transform_seed': 42, 'unique': False, 'verbose': False} concept-graphs-api-1 | [2024-07-17 11:51:34,328] INFO in cluster_functions: -- Calculating K-Elbow ... concept-graphs-api-1 | INFO:root:---- shape of embeddings: ((5965, 100)) concept-graphs-api-1 | [2024-07-17 11:51:34,328] INFO in cluster_functions: ---- shape of embeddings: ((5965, 100)) concept-graphs-api-1 | INFO:root:---- Arguments: {'k': (10, 100), 'show': False} concept-graphs-api-1 | [2024-07-17 11:51:34,329] INFO in cluster_functions: ---- Arguments: {'k': (10, 100), 'show': False} concept-graphs-api-1 | INFO:root:-- Clustering ... concept-graphs-api-1 | [2024-07-17 11:51:56,398] INFO in cluster_functions: -- Clustering ... concept-graphs-api-1 | INFO:root: (kmeans) with Arguments: {} concept-graphs-api-1 | Number of Clusters: 15 concept-graphs-api-1 | [2024-07-17 11:51:56,398] INFO in cluster_functions: (kmeans) with Arguments: {} concept-graphs-api-1 | Number of Clusters: 15 concept-graphs-api-1 | /usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:543: UserWarning: The parameter 'stop_words' will not be used since 'analyzer' != 'word' concept-graphs-api-1 | warnings.warn( concept-graphs-api-1 | INFO:root:Building Document Concept Matrix with following arguments: concept-graphs-api-1 | {'cluster_distance': 0.7, 'cluster_min_size': 4, 'cluster_exclusion_ids': None, 'graph_unroll': False, 'graph_simplify': 0.5, 'graph_simplify_alg': 'significance', 'graph_sub_clustering': False, 'graph_distance_cutoff': 0.5, 'connection_distance': 2, 'restrict_to_cluster': True, 'filter_min_df': 1, 'filter_max_df': 1.0, 'filter_stop': [], 'break_after_graph_creation': True, 'graph_cosine_weight': 0.6, 'graph_merge_threshold': 0.95, 'graph_weight_cut_off': 0.5, 'self': <cluster_functions.WordEmbeddingClustering._ConceptGraphClustering object at 0x7f2346ad4fd0>} concept-graphs-api-1 | [2024-07-17 11:52:15,464] INFO in cluster_functions: Building Document Concept Matrix with following arguments: concept-graphs-api-1 | {'cluster_distance': 0.7, 'cluster_min_size': 4, 'cluster_exclusion_ids': None, 'graph_unroll': False, 'graph_simplify': 0.5, 'graph_simplify_alg': 'significance', 'graph_sub_clustering': False, 'graph_distance_cutoff': 0.5, 'connection_distance': 2, 'restrict_to_cluster': True, 'filter_min_df': 1, 'filter_max_df': 1.0, 'filter_stop': [], 'break_after_graph_creation': True, 'graph_cosine_weight': 0.6, 'graph_merge_threshold': 0.95, 'graph_weight_cut_off': 0.5, 'self': <cluster_functions.WordEmbeddingClustering._ConceptGraphClustering object at 0x7f2346ad4fd0>} concept-graphs-api-1 | INFO:root:Building Concept Graphs... (exclusion_ids: []) concept-graphs-api-1 | [2024-07-17 11:52:15,465] INFO in cluster_functions: Building Concept Graphs... (exclusion_ids: []) concept-graphs-api-1 | INFO:root:Filtering phrases concept-graphs-api-1 | [2024-07-17 11:52:15,465] INFO in cluster_functions: Filtering phrases concept-graphs-api-1 | Saved under: /rest_api/tmp/test_data_source_3/test_data_source_3_embedding.pickle concept-graphs-api-1 | Saved under: /rest_api/tmp/test_data_source_3/test_data_source_3_clustering.pickle 53%|█████▎ | 8/15 [00:00<00:00, 23.32it/s] concept-graphs-api-1 | INFO:root:Cutting edges (significance)... concept-graphs-api-1 | [2024-07-17 11:52:17,016] INFO in cluster_functions: Cutting edges (significance)... 100%|██████████| 8/8 [00:00<00:00, 12.48it/s]
I don't really care about whether it succeeds or fails. It just doesn't seem right if the pipeline is declared as "successful" in the first response, but actually failed or is still running.
It looks like the concept graph Docker container is eating up my memory and is dying.
EntityForm
theexpression-input
for aCompositeConcept
dictates thatConstant
are excluded. However, theDistance
function requires as its second parameter a constant (or rather an integer).Build query
button toDocuments
page. (see picture 1) I thought that it would be nice to jump into theQueries
page with theData Source
selected from where one jumped. However, before I make it functional, I wanted to see/check if this is something that goes along with the design choice because aRepository
needs to be selected as wellQuery results
(Show data set
) the left menu is not updated (see picture 2)Documents
resets the url but still shows the query results (one needs to hit F5 or change to another TOP page and back)