ALIADA / aliada-tool

Aliada tool implementation
GNU General Public License v3.0
35 stars 14 forks source link

RDFizer benchmark #103

Closed agazzarini closed 9 years ago

agazzarini commented 9 years ago

I'm opening this issue after the mail sent by @adampogany, just to trace information about benchmarks and performance.

I believe this is not strictly part of the 2nd prototype (which I think should have a full functional coverage) so I won't indicate a milestone; we will leave this issue open just to accumulate tests, comments, feedbacks and (eventually) fixes.

However, the namespaces support is missing so XPATH expressions in LIDO and DC are not working. This is the goal of the next commit.

agazzarini commented 9 years ago

At a first try, the RDFizer seems to manage well its own resources. I think the main drawback can be found in the wide XPATH usage. Anyway, as you can see the system is almost stable, as the memory (I explicitly given less than 1GB of Xmx) and the garbage collector have a regular cicle.

memory

Also (second screenshot) the stats seems to be quite stable. Actually I'd like to speed up a lot the process, but remember that here the bottleneck is for sure the remote SPARQL endpoint: the system is doing a loooot of (blocking) I/O so basically the "triples-channel", the asynchronous channel which is in charge to asynchrounsly send the produced triples...is always full and busy.

jobstats

xmolero commented 9 years ago

You don´t have problems with java heap memory or used memory, @agazzarini?

When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.

screenshot - 30_04_2015 9_53_53

agazzarini commented 9 years ago

No, what are the VM parameters (Xms Xmx etc)? I'm running with (about) 800MB of heap and no, I don't have any problem at all. Performances are another matter, but the process is stable

On 04/30/2015 10:00 AM, xmolero wrote:

You don´t have problems with java heap memory or used memory, @agazzarini https://github.com/agazzarini?

When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.

screenshot - 30_04_2015 9_53_53 https://cloud.githubusercontent.com/assets/10447860/7408679/9a5fb678-ef1f-11e4-8f93-749eb854ed95.png

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-97698791.

xmolero commented 9 years ago

These are the VM parameters: JAVA_OPTS="-d64 -Xms2000m -Xmx4000m -log4j.configuration=file:///usr/share/tomcat/conf/log4j.xml"

agazzarini commented 9 years ago

How many records are you trying to translate? If that is not so big, could you please send me that file?

On 04/30/2015 10:00 AM, xmolero wrote:

You don´t have problems with java heap memory or used memory, @agazzarini https://github.com/agazzarini?

When I did some conversions, Aliada server increase and increase the used memory and finally one conversion went slowly and it didn´t finish.

screenshot - 30_04_2015 9_53_53 https://cloud.githubusercontent.com/assets/10447860/7408679/9a5fb678-ef1f-11e4-8f93-749eb854ed95.png

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-97698791.

xmolero commented 9 years ago

10000 records but we have 9 files more like this to do the conversion. I send you an email with the file. Thanks a lot!

agazzarini commented 9 years ago

I added a cache-mode in the templating engine settings which was the actual concrete problem. However, as far as I understand we have the following issues in terms of bottlenecks:

a) proper pipelines configuration It's not possible to determine dynamically the numer of consumers (i.e. threads) that are listen on a given channel. So maybe the current setting could be too minimal or too large

b) NER I already saw that during the implementation. The Named Entity Recognition slows down the process drammatically. In addition, it requires a lot of memory for loading the classifier model.

c) XPATH The wide usage of XPATH is another thing that consumes a lot of resources.

d) Blocking I/O As explained above, there's a lot of oubound I/O which is blocking. Each thread, while talking to the RDF Store, is blocked and therefore the queue grows its size. At the moment there's a parameter "blockWhenFull", which prevents OOM but definitely slows down the overall process.

On top of that, after forcing the cache as mentioned above, I'm able to process the 10000 records. I don't think, at this, point that file size matters.

agazzarini commented 9 years ago

I have introduced a simplified + optimized version of XPATH. So beside the ordinary XPath class there's a sibling OXPath. The performance are very good, orders of magnitude greater than the previous version. I was be able to process 10.000 records in about 1:30 minute (I3 with 8CPU and Xms3000m / Xmx3000m)

agazzarini commented 9 years ago

However, the namespaces support (for LIDO and DC) are still missing, so I'm leaving this issue open

agazzarini commented 9 years ago

Added support for namespaces. Issue is closed as the overall throughtput is now definitely good. As we briefly spoke in the last call, I suggest to feed 1 job with 1 file with a loooot of records.

agazzarini commented 9 years ago
<job>
   <completed>true</completed>
   <start-date>2015-05-06T13:41:45+02:00</start-date>         
   <end-date>2015-05-06T13:42:21+02:00</end-date>
   <format>marcxml</format>
   <id>4321</id>
   <processed-records-count>10021</processed-records-count>
   <records-throughput>96</records-throughput>
   <running>false</running>
   <status-code>0</status-code>
   <total-records-count>10021</total-records-count>
   <output-statements-count>1495914</output-statements-count>
   <triples-throughput>14268</triples-throughput>
</job>
agazzarini commented 9 years ago

Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service

I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)

cgareta commented 9 years ago

Andrea, we will discuss this issue next week in Budapest

Thank you

Cristina

De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)

Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service

I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650 . https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif

agazzarini commented 9 years ago

Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.

Best, Andrea

2015-05-07 17:39 GMT+02:00 cgareta notifications@github.com:

Andrea, we will discuss this issue next week in Budapest

Thank you

Cristina

De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)

Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service

  • the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
  • the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.

I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)

— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525.

agazzarini commented 9 years ago

Sorry, re-reading my closing post I see I made a mistake: I meant "Guys, even if the issue is closed"

Best, Andrea

2015-05-07 17:44 GMT+02:00 Andrea Gazzarini a.gazzarini@gmail.com:

Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.

Best, Andrea

2015-05-07 17:39 GMT+02:00 cgareta notifications@github.com:

Andrea, we will discuss this issue next week in Budapest

Thank you

Cristina

De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)

Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service

  • the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
  • the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.

I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)

— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525.

cgareta commented 9 years ago

Great!!

De: Andrea Gazzarini [mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 17:45 Para: ALIADA/aliada-tool CC: cgareta Asunto: Re: [aliada-tool] RDFizer benchmark (#103)

Hi Cristina, the issue has been fixed: I completely rewrote the XPATH engine and now I processed that file (about 10000 records) in 30 secs. See the issue on github for more details.

Best, Andrea

2015-05-07 17:39 GMT+02:00 cgareta < mailto:notifications@github.com notifications@github.com>:

Andrea, we will discuss this issue next week in Budapest

Thank you

Cristina

De: Andrea Gazzarini [ mailto:notifications@github.com mailto:notifications@github.com] Enviado el: jueves, 07 de mayo de 2015 15:46 Para: ALIADA/aliada-tool Asunto: Re: [aliada-tool] RDFizer benchmark (#103)

Guys, even if the issue is open, I still want to underline the negative impact that the NER process has on the resource usage (RAM, especially). I created 2 implementations of the NER service

  • the first is a singleton, which, as we know, needs synchronization. This is bad, but it avoids to load the NER classifiers whcih are not thread-safe and very huge
  • the second is a thread-local version, which loads 1 classifier for each calling thread. Since the number of threads (i.e. consumers) attached to a given queue is determined in configuration, in case we have a small number of them, this implementation could be used. The advantage is that it doesn't require synchronization, the drawback is that there is a different instance of a classifier for each calling thread.

I put the first (singleton) as default in ALIADA, because at the moment, with the new XPATH engine, we process (about) 10.000 records in 30 secs. (8CPU Xms3000m Xmx3000m no SSD)

— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650 https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99871650> . < https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif https://github.com/notifications/beacon/AH33Ntit0dENH803MHgqO-4KIGt6st_mks5oG2QUgaJpZM4EMUG8.gif>

— Reply to this email directly or view it on GitHub < https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525 https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99914525>.

— Reply to this email directly or view it on GitHub https://github.com/ALIADA/aliada-tool/issues/103#issuecomment-99916912 . https://github.com/notifications/beacon/AH33NlkW0RKC_inDmnozoblY4MFb3Ekhks5oG3_5gaJpZM4EMUG8.gif