Closed agazzarini closed 9 years ago
At a first try, the RDFizer seems to manage well its own resources. I think the main drawback can be found in the wide XPATH usage. Anyway, as you can see the system is almost stable, as the memory (I explicitly given less than 1GB of Xmx) and the garbage collector have a regular cicle.
Also (second screenshot) the stats seems to be quite stable. Actually I'd like to speed up a lot the process, but remember that here the bottleneck is for sure the remote SPARQL endpoint: the system is doing a loooot of (blocking) I/O so basically the "triples-channel", the asynchronous channel which is in charge to asynchrounsly send the produced triples...is always full and busy.
You don´t have problems with java heap memory or used memory, @agazzarini?
When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.
No, what are the VM parameters (Xms Xmx etc)? I'm running with (about) 800MB of heap and no, I don't have any problem at all. Performances are another matter, but the process is stable
On 04/30/2015 10:00 AM, xmolero wrote:
You don´t have problems with java heap memory or used memory, @agazzarini https://github.com/agazzarini?
When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.
screenshot - 30_04_2015 9_53_53 https://cloud.githubusercontent.com/assets/10447860/7408679/9a5fb678-ef1f-11e4-8f93-749eb854ed95.png
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-97698791.
These are the VM parameters: JAVA_OPTS="-d64 -Xms2000m -Xmx4000m -log4j.configuration=file:///usr/share/tomcat/conf/log4j.xml"
How many records are you trying to translate? If that is not so big, could you please send me that file?
On 04/30/2015 10:00 AM, xmolero wrote:
You don´t have problems with java heap memory or used memory, @agazzarini https://github.com/agazzarini?
When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.
screenshot - 30_04_2015 9_53_53 https://cloud.githubusercontent.com/assets/10447860/7408679/9a5fb678-ef1f-11e4-8f93-749eb854ed95.png
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-97698791.
10000 records but we have 9 files more like this to do the conversion. I send you an email with the file. Thanks a lot!
I added a cache-mode in the templating engine settings which was the actual concrete problem. However, as far as I understand we have the following issues in terms of bottlenecks:
a) proper pipelines configuration It's not possible to determine dynamically the numer of consumers (i.e. threads) that are listen on a given channel. So maybe the current setting could be too minimal or too large
b) NER I already saw that during the implementation. The Named Entity Recognition slows down the process drammatically. In addition, it requires a lot of memory for loading the classifier model.
c) XPATH The wide usage of XPATH is another thing that consumes a lot of resources.
d) Blocking I/O As explained above, there's a lot of oubound I/O which is blocking. Each thread, while talking to the RDF Store, is blocked and therefore the queue grows its size. At the moment there's a parameter "blockWhenFull", which prevents OOM but definitely slows down the overall process.
On top of that, after forcing the cache as mentioned above, I'm able to process the 10000 records. I don't think, at this, point that file size matters.
I have introduced a simplified + optimized version of XPATH. So beside the ordinary XPath class there's a sibling OXPath. The performance are very good, orders of magnitude greater than the previous version. I was be able to process 10.000 records in about 1:30 minute (I3 with 8CPU and Xms3000m / Xmx3000m)
However, the namespaces support (for LIDO and DC) are still missing, so I'm leaving this issue open
Added support for namespaces. Issue is closed as the overall throughtput is now definitely good. As we briefly spoke in the last call, I suggest to feed 1 job with 1 file with a loooot of records.
<job>
<completed>true</completed>
<start-date>2015-05-06T13:41:45+02:00</start-date>
<end-date>2015-05-06T13:42:21+02:00</end-date>
<format>marcxml</format>
<id>4321</id>
<processed-records-count>10021</processed-records-count>
<records-throughput>96</records-throughput>
<running>false</running>
<status-code>0</status-code>
<total-records-count>10021</total-records-count>
<output-statements-count>1495914</output-statements-count>
<triples-throughput>14268</triples-throughput>
</job>
Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service
I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)
Andrea, we will discuss this issue next week in Budapest
Thank you
Cristina
De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)
Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service
I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650 . https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif
Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.
Best, Andrea
2015-05-07 17:39 GMT+02:00 cgareta notifications@github.com:
Andrea, we will discuss this issue next week in Budapest
Thank you
Cristina
De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)
Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service
- the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
- the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.
I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)
— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525.
Sorry, re-reading my closing post I see I made a mistake: I meant "Guys, even if the issue is closed"
Best, Andrea
2015-05-07 17:44 GMT+02:00 Andrea Gazzarini a.gazzarini@gmail.com:
Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.
Best, Andrea
2015-05-07 17:39 GMT+02:00 cgareta notifications@github.com:
Andrea, we will discuss this issue next week in Budapest
Thank you
Cristina
De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)
Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service
- the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
- the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.
I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)
— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525.
Great!!
De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 17:45 Para: ALIADA/aliada-tool CC: cgareta Asunto: Re: [aliada-tool] RDFizer benchmark (#103)
Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.
Best, Andrea
2015-05-07 17:39 GMT+02:00 cgareta < mailto:notifications@github.com notifications@github.com>:
Andrea, we will discuss this issue next week in Budapest
Thank you
Cristina
De: Andrea Gazzarini [ mailto:notifications@github.com mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)
Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service
- the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
- the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.
I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)
— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650 https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>
— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525 https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525>.
— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99916912 . https://github.com/notifications/beacon/AH33NlkW0RKC_inDmnozoblY4MFb3Ekhks5oG3_5gaJpZM4EMUG8.gif
I'm opening this issue after the mail sent by @adampogany, just to trace information about benchmarks and performance.
I believe this is not strictly part of the 2nd prototype (which I think should have a full functional coverage) so I won't indicate a milestone; we will leave this issue open just to accumulate tests, comments, feedbacks and (eventually) fixes.
However, the namespaces support is missing so XPATH expressions in LIDO and DC are not working. This is the goal of the next commit.