Closed raphaelhoffmann closed 9 years ago
@zhangce @SenWu How could this be? This performance bug should have been fixed months ago.
Perhaps it has to do with the volume /lfs/local/0 being 99% full. We may have to ask people again to free up space.
how many cpu cores are busy if you top
?
Can you let me know which machine or some connection string to DB?
Ce On Sep 6, 2015 3:27 PM, "Raphael Hoffmann" notifications@github.com wrote:
Perhaps it has to do with the volume /lfs/local/0 being 99% full. We may have to ask people again to free up space.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138132793 .
The second phase of sequence assignment just finished, and I want to let it continue (raiders4, DB is memex_mar2015_large). I'll let you know the next time time I'm running sequence assignment.
@raphaelhoffmann Thanks! Could you share the log of your run to us?
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054 .
Correct me if I'm wrong -- looks like all those post-hoc-assigned sequence IDs are always used as array indexes in the sampler and never for SQL joins. If that's true, what are the challenges to do without the sequence assignment steps?
Feng
On Sun, Sep 6, 2015 at 5:43 PM Raphael Hoffmann notifications@github.com wrote:
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054
.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138144569 .
Or if the ID fields are used as conceptual foreign keys in the DB, maybe we could use a fingerprint of the tuple.
On Sun, Sep 6, 2015 at 8:43 PM Feng Niu niufeng14@gmail.com wrote:
Correct me if I'm wrong -- looks like all those post-hoc-assigned sequence IDs are always used as array indexes in the sampler and never for SQL joins. If that's true, what are the challenges to do without the sequence assignment steps?
Feng
On Sun, Sep 6, 2015 at 5:43 PM Raphael Hoffmann notifications@github.com wrote:
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054
.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138144569 .
The finger print of tuples in GP is a little weird (it is too physical in the sense that there are no guarantee of when it will not change at all)... need to do more profiling after Raphael's current run. That function should be faster than 2MB/s if the DB is working properly. On Sep 6, 2015 9:04 PM, "alldefector" notifications@github.com wrote:
Or if the ID fields are used as conceptual foreign keys in the DB, maybe we could use a fingerprint of the tuple.
On Sun, Sep 6, 2015 at 8:43 PM Feng Niu niufeng14@gmail.com wrote:
Correct me if I'm wrong -- looks like all those post-hoc-assigned sequence IDs are always used as array indexes in the sampler and never for SQL joins. If that's true, what are the challenges to do without the sequence assignment steps?
Feng
On Sun, Sep 6, 2015 at 5:43 PM Raphael Hoffmann < notifications@github.com> wrote:
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub <
https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054
.
— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138144569
.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138178468 .
@raphaelhoffmann From the log file, there is nothing wrong... Please let me know if you have further information. @alldefector We use assigned variable IDs in SQL joins to ground edges, you can find it here: https://github.com/HazyResearch/deepdive/blob/1573cb703079b10cb2dcc93b3d67e09d079d628c/src/main/scala/org/deepdive/inference/SQLInferenceRunner.scala#L718.
Ce: Job is no longer running. It may have to do with the disk being too full.
On Sep 6, 2015, at 9:08 PM, zhangce notifications@github.com wrote:
The finger print of tuples in GP is a little weird (it is too physical in the sense that there are no guarantee of when it will not change at all)... need to do more profiling after Raphael's current run. That function should be faster than 2MB/s if the DB is working properly. On Sep 6, 2015 9:04 PM, "alldefector" notifications@github.com wrote:
Or if the ID fields are used as conceptual foreign keys in the DB, maybe we could use a fingerprint of the tuple.
On Sun, Sep 6, 2015 at 8:43 PM Feng Niu niufeng14@gmail.com wrote:
Correct me if I'm wrong -- looks like all those post-hoc-assigned sequence IDs are always used as array indexes in the sampler and never for SQL joins. If that's true, what are the challenges to do without the sequence assignment steps?
Feng
On Sun, Sep 6, 2015 at 5:43 PM Raphael Hoffmann < notifications@github.com> wrote:
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub <
https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054
.
— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138144569
.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138178468 .
— Reply to this email directly or view it on GitHub.
Oh, I meant doing fingerprinting in our code over the tuple content as the tuple is generated. Basically a hash function of the tuple.
On Sun, Sep 6, 2015 at 9:08 PM zhangce notifications@github.com wrote:
The finger print of tuples in GP is a little weird (it is too physical in the sense that there are no guarantee of when it will not change at all)... need to do more profiling after Raphael's current run. That function should be faster than 2MB/s if the DB is working properly. On Sep 6, 2015 9:04 PM, "alldefector" notifications@github.com wrote:
Or if the ID fields are used as conceptual foreign keys in the DB, maybe we could use a fingerprint of the tuple.
On Sun, Sep 6, 2015 at 8:43 PM Feng Niu niufeng14@gmail.com wrote:
Correct me if I'm wrong -- looks like all those post-hoc-assigned sequence IDs are always used as array indexes in the sampler and never for SQL joins. If that's true, what are the challenges to do without the sequence assignment steps?
Feng
On Sun, Sep 6, 2015 at 5:43 PM Raphael Hoffmann < notifications@github.com> wrote:
Sure! The log of the run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
Database is memex_mar2015_large and it's running inference only. I launched this job at 9:30AM and it still hasn't reached the sampling phase.
Thanks!
On Sun, Sep 6, 2015 at 5:28 PM, SenWu notifications@github.com wrote:
Thanks! Could you share the log of your run to us?
— Reply to this email directly or view it on GitHub <
https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138140054
.
— Reply to this email directly or view it on GitHub <
https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138144569
.
— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138178468
.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138180360 .
Thanks @SenWu. If we assign non-contiguous IDs or fingerprints to the tuples (say in Java or bash) right before each tuple is inserted into the DB, those joins should still work, right?
Is it possible to tweak the sampler so that we can feed it variables and factors without sequential IDs?
On Sun, Sep 6, 2015 at 9:12 PM SenWu notifications@github.com wrote:
@raphaelhoffmann https://github.com/raphaelhoffmann From the log file, there is nothing wrong... Please let me know if you have further information. @alldefector https://github.com/alldefector We use assigned variable IDs in SQL joins to ground edges, you can find it here: https://github.com/HazyResearch/deepdive/blob/1573cb703079b10cb2dcc93b3d67e09d079d628c/src/main/scala/org/deepdive/inference/SQLInferenceRunner.scala#L718 .
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138182930 .
@alldefector If we assign non-contiguous IDs, then we need do remapping non-contiguous IDs to contiguous IDs in sampler.
Is it possible to tweak the sampler so that we can feed it variables and factors without sequential IDs? Yes, we can do this. However, we might have following things to take care:
@raphaelhoffmann I profile the query SELECT fast_seqassign('dd_query_rates_features', 0);
which took 2 hours in your log. I think it is a DB problem instead of how we implement this function.
The key ratio to see whether the bottleneck is really the fast_seqassign
function is to see the ratio of execution time between
`create table tmp as select * from dd_query_rates_features;`
and
`SELECT fast_seqassign('dd_query_rates_features', 0);`
If the ratio is off by 1 significantly, that means the bottleneck is fast_seqassign
, which I think is not the case here (I did not let the CREATE TABLE query finish--it is not making much progress after 1 hour)...
I think for this Github issue, we need to restart DB, try to VACUUM etc.
@alldefector The game of optimizing this query is try to eliminate the one pass of table writing caused by UPDATE
(because postgresql does not have in-place update)... We have no global sync for contiguous IDs, so we might not be able to see much improvement if we do non-contiguous IDs...
@chrismre mentioned yesterday that why we do not use generate_sequence
(I forget why I did not use it yesterday... Now I remember :)) The problem is that GP's optimizer (at least in 2013) does not know that tuples with the same gp_segment_id
is on the same segment, so I can not get a sequential scan plan with generate_sequence
... I do agree that if we change that Python function to C++, we might see some improvement, but we need to profile it first (by reporting the ratio of runtime between the above two queries)
Ce
@zhangce Yeah, getting rid of table rewrite (UPDATE) is exactly what I had in mind. Is the sequentialness is absolutely necessary but generating it inside the DB is costly, can we design a way to generate the IDs from the extractor driver code? If all extraction is done in one process, we could just use a simple integer variable. If there are multiple extraction processes, maybe we could use some sort of global sequence ID service to coordinate b/w the extraction processes. One example of such a service would be a synchronized piece of shared memory (say /dev/shm) storing next available ID. Each process can reserve IDs by chunks -- "I'm taking the range 1-1000, increment the global counter to 1001". That way, there might be gaps in the final list of IDs, but the amount of gaps is limited to chunk_size X num_processors.
On Mon, Sep 7, 2015 at 9:56 AM zhangce notifications@github.com wrote:
@raphaelhoffmann https://github.com/raphaelhoffmann I profile the query SELECT fast_seqassign('dd_query_rates_features', 0); which took 2 hours in your log. I think it is a DB problem instead of how we implement this function.
The key ratio to see whether the bottleneck is really the fast_seqassign function is to see the ratio of execution time between
`create table tmp as select * from dd_query_rates_features;`
and
`SELECT fast_seqassign('dd_query_rates_features', 0);`
If the ratio is off by 1 significantly, that means the bottleneck is fast_seqassign, which I think is not the case here (I did not let the CREATE TABLE query finish--it is not making much progress after 1 hour)...
I think for this Github issue, we need to restart DB, try to VACUUM etc.
@alldefector https://github.com/alldefector The game of optimizing this query is try to eliminate the one pass of table writing caused by UPDATE (because postgresql does not have in-place update)... We have no global sync for contiguous IDs, so we might not be able to see much improvement if we do non-contiguous IDs...
@chrismre https://github.com/chrismre mentioned yesterday that why we do not use generate_sequence (I forget why I did not use it yesterday... Now I remember :)) The problem is that GP's optimizer (at least in 2013) does not know that tuples with the same gp_segment_id is on the same segment, so I can not get a sequential scan plan with generate_sequence... I do agree that if we change that Python function to C++, we might see some improvement, but we need to profile it first (by reporting the ratio of runtime between the above two queries)
Ce
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138340266 .
@alldefector Your recommendation makes perfect sense for one fast_seqassign
call (there are two such calls that are expensive)
1. There is one `fast_seqassign` call for variables loaded from the output of extractors.
2. There is another `fast_seqassign` call for factors output by SQL grounding queries.
For 1, we could just have a really fast shell script to assign IDs in TSV before loading. (just like your global sequence ID service idea)
For 2, we don't really need ID for factors, so that part can probably be eliminated completely.
Awesome! Who is going to own those changes?
On Mon, Sep 7, 2015 at 2:40 PM zhangce notifications@github.com wrote:
@alldefector https://github.com/alldefector Your recommendation makes perfect sense for one fast_seqassign call (there are two such calls that are expensive)
- There is one
fast_seqassign
call for variables loaded from the output of extractors.- There is another
fast_seqassign
call for factors output by SQL grounding queries.For 1, we could just have a really fast shell script to assign IDs in TSV before loading. (just like your global sequence ID service idea)
For 2, we don't really need ID for factors, so that part can probably be eliminated completely.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138382390 .
We restarted greenplum (thanks, @SenWu !), free'd up some space, and re-ran this process. The log of the new run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
We are still seeing the same runtimes (3h for sequence assignment of rates, 2 or 3h for sequence assignment of features). I didn't yet run the CREATE TABLE command Ce suggested, but I suspect that the runtime is indeed normal.
@zhangce: I think the changes you propose sound great! Any chance you can make a first stab at it?
Thanks!
Did you VACUUM?
Ce On Sep 7, 2015 7:45 PM, "Raphael Hoffmann" notifications@github.com wrote:
We restarted greenplum (thanks, Sen!), free'd up some space, and re-ran this process. The log of the new run is here:
/lfs/local/1/raphaelh/dd/new/deepdive/log/2015-09-06T093453.txt
We are still seeing the same runtimes (3h for sequence assignment of rates, 2 or 3h for sequence assignment of features). I didn't yet run the CREATE TABLE command Ce suggested, but I suspect that the runtime is indeed normal.
Ce: I think the changes you propose sound great! Any chance you can make a first stab at it?
Thanks!
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138415509 .
I ran VACUUM last night and then restarted the job. The time for sequence assignment on "rates" went down from 3h to 2:18h. I haven't checked the time for sequence assignment on the feature table yet.
So, it's a small improvement, but still sequence assignment is a major bottleneck.
Looks like we may want to revisit the design of the internal tables. For example, https://github.com/HazyResearch/deepdive/blob/master/src/main/scala/org/deepdive/inference/InferenceNamespace.scala#L27-L28
We have these two prefixes: "ddquery${tableName}" and "ddfactors${tableName}".
"ddquery" is presumably for variables, but it's actually used for factors. Variable tables are created by users (i.e., no prefix). "ddfactors" is unused anywhere.
I suppose at some point somebody planned to copy user-created tables (e.g., genes) to internal variable tables (e.g., dd_query_genes), and if that's the case, we could piggyback ID assignment on the copy. Not sure if any previous version of DD ever did such copying.
If there is a strong reason to not have internal variable tables, why don't we just add an auto-increment (with shared sequence) ID column to each user-created/defined variable table? (I'd ditch attempted support for non-pg family DBs, namely yourSQL.)
Would the shared sequence be a bottleneck for parallel DBs?
On Tue, Sep 8, 2015 at 2:23 PM Raphael Hoffmann notifications@github.com wrote:
I ran VACUUM last night and then restarted the job. The time for sequence assignment on "rates" went down from 3h to 2:18h. I haven't checked the time for sequence assignment on the feature table yet.
So, it's a small improvement, but still sequence assignment is a major bottleneck.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138706567 .
Assigning the IDs when writing the table the first time seems to be the right thing to do. But – I thought that the reason why we have stored procedures for sequence assignment is because we found that the sequence generators were too slow (?)
If the sequence generator turns out to be a bottleneck, it's likely because it's syncing after every row. In that case we could create a stored procedure that does the insert in batches...
@alldefector @raphaelhoffmann Sorry I am confused on this... The 2:18h number is on raiders4 that you just said the IO bandwidth is 4x lower than than raiders3?... Maybe we should do proper profiling before we jump into discussions of all those details? Let's fix the machine for you first, and then restart our discussion?
@zhangce. Sounds good, we can re-run the experiment once raiders4 is working properly again.
Here are a few more measured runtimes for another extractor on an AWS r3.8xlarge instance, using a RAID over the local SSD volumes for I/O. I/O throughput is very good on that machine (600-700MB/sec).
3:00h running extractor
1:40h SELECT copy_table_assign_ids_replace('public', 'dd_query_f_isoutcall_features', 'id', 0)
3:00h COPY (SELECT DISTINCT t0.id AS factor_id, t1.id AS weight_id, "isoutcall.id"
FROM dd_query_f_isoutcall_features t0, dd_weights_f_isoutcall_features t1
WHERE t0."isoutcall.feature" = t1."isoutcall.feature") TO STDOUT;
1:45h CREATE UNLOGGED TABLE dd_isoutcall_vtype AS SELECT t0.id, CASE WHEN t2.variable_id IS NOT NULL AND is_outcall IS NOT NULL THEN 2
WHEN t1.variable_id IS NOT NULL THEN 0
WHEN is_outcall IS NOT NULL THEN 1
ELSE 0
END as __dd_variable_type__
FROM isoutcall t0 LEFT OUTER JOIN dd_graph_variables_holdout t1 ON t0.id=t1.variable_id
LEFT OUTER JOIN dd_graph_variables_observation t2 ON t0.id=t2.variable_id
0:45h Unloading factor graph to disk (tobinary)
0:30h Sampler loading factor graph until crash due to out of memory
I think a larger bottleneck seems to be the third query not the sequential assign one? But in the next version of the sampler, we will get rid of this query soon...
Ce On Sep 8, 2015 6:37 PM, "Raphael Hoffmann" notifications@github.com wrote:
@zhangce https://github.com/zhangce. Sounds good, we can re-run the experiment once raiders4 is working properly again.
Here are a few more measured runtimes for another extractor on an AWS r3.8xlarge instance, using a RAID over the local SSD volumes for I/O. I/O throughput is very good on that machine (600-700MB/sec).
3:00h running extractor
1:40h SELECT copy_table_assign_ids_replace('public', 'dd_query_f_isoutcall_features', 'id', 0)
3:00h COPY (SELECT DISTINCT t0.id AS factor_id, t1.id AS weight_id, "isoutcall.id" FROM dd_query_f_isoutcall_features t0, dd_weights_f_isoutcall_features t1 WHERE t0."isoutcall.feature" = t1."isoutcall.feature") TO STDOUT;
1:45h CREATE UNLOGGED TABLE dd_isoutcall_vtype AS SELECT t0.id, CASE WHEN t2.variable_id IS NOT NULL AND is_outcall IS NOT NULL THEN 2 WHEN t1.variable_id IS NOT NULL THEN 0 WHEN is_outcall IS NOT NULL THEN 1 ELSE 0 END as dd_variable_type FROM isoutcall t0 LEFT OUTER JOIN dd_graph_variables_holdout t1 ON t0.id=t1.variable_id LEFT OUTER JOIN dd_graph_variables_observation t2 ON t0.id=t2.variable_id
0:45h Unloading factor graph to disk (tobinary)
0:30h Sampler loading factor graph until crash due to out of memory
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/deepdive/issues/379#issuecomment-138749010 .
I see. @SenWu has also created a pull request that gets rid of the DISTINCT in the third query. That simplifies the query plan, getting rid of a sort, speeding this query up.
@zhangce I'm done with my processing work, so this might be a good time to investigate the issues on raiders4. Both seem to have the same configuration with two NAS volumes:
/lfs/local/0 (~15TB)
/lfs/local/1 (~32TB)
I measured sequential write I/O (using dd if=/dev/zero of=/lfs/local/1/raphaelh/xxx bs=64k count=131072 conv=fsync
) and found that the speed for /lfs/local/0 is about 100MB/sec and for /lfs/local/1 is about 150MB/sec. These numbers are the same for raiders3 and raiders4. Since /lfs/local/1 is much larger, and thus likely contains more disks, I expect a larger difference for random I/O.
On raiders3, greenplum uses /lfs/local/1 for storage, on raiders4, greenplum uses /lfs/local/0 for storage. That could explain some of the difference in performance. Also, the volume is still 96% full on raiders4 (used to be 99%) and there might be fragmentation.
So, I think the TODOs to fix raider4 are:
@SenWu Would it be possible to move greenplum over?
@raphaelhoffmann Lets move this discussion off the public issue to avoid broadcast to everyone (It is getting too much detailed to most subscriber)?
The 100MB vs. 150MB throughput does not explain the 4x difference you observed? Should we Skype quickly? Sorry some numbers do not quite add up and I am still confused on what is the issue...
Thanks!
Ce
Since this thread discusses multiple topics, we decided to close this issue and open separate new issues. We will also continue our investigation of hardware issues offline.
On greenplum (raiders4), the following query is executed while running our rates extractor:
SELECT fast_seqassign('rates', 0)
Throughput is 2.4 MB/sec.
This means it takes alone 3h to assign IDs to the candidates, and another 3h+ (still running) to assign IDs to the features.
Any ideas on how we could speed up sequence assignment?