fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
933 stars 195 forks source link

Frame.Join() undesirable sort of columns order based on column name #304

Closed omencat closed 4 years ago

omencat commented 9 years ago

Hi,

I have a List<Frame<DateTimeOffset, string>> and I am outer joining the frames in the list in to a single consolidated frame. The frames in the list are all in correct order, but when I iterate through each frame in the list to join in to the consolidate frame, the join method appears to resort the columnkeys by alphabetical order of the column name. This is not the desired result. To test, I prepended a numerical counter (1_frameA) to the column name, and sure enough the columns are now in the proper order. I didn't find anything in the API guide about column resorting.

        public void SortFrames(List<string> correctOrder, List<Frame<DateTimeOffset, string>> frames)
        {
            foreach (var c in correctOrder)
            {
                var matchedFrame = frames.Find(frame => frame.ColumnIndex.Keys[0].Contains(c));
                combinedFrame = combinedFrame.Join(matchedFrame, JoinKind.Outer);
            }
        }

Thoughts?

Thanks.

tpetricek commented 9 years ago

Hmm - I guess what could be happening here is that the columns in the original frames are initially sorted by name (perhaps accidentally, or perhaps there is just one column) and when Deedle merges the columns, it preserves this property - and reorders the columns. Would that explain this?

I think we don't currently have a way to say "treat columns as unordered" (even when they are ordered).

As a workaround, you could change this to use Frame.FromColumns (to create a frame from individual columnkey-series pairs), which will preserve the order in which you specify them.

tpetricek commented 9 years ago

I think we could actually consider changing the default here - and treat string keys as unordered unless you explicitly call "sort by keys". I don't imagine people will often want to sort string keys automatically... (unlike with date/datetime or perhaps int keys).

Alternatively, we could disable the "automatic ordering preservation" behavior for column keys and leave it only for row keys (not sure what is better...)

@adamklein @hmansell Do you have thoughts on this?

omencat commented 9 years ago

I will look in to your suggestion. To answer your question about original frame, it is initialized as empty with the Frame.CreateEmpty method. When I iterate the joining of the frames I am sure they are in the order I want, that is to say, for each join to the consolidated frame, I want the incoming "right" column to join as the rightmost column. But here is an example of what I am observing...

Empty frame... join column "X2014" // key[0] = "X2014" join column "Z2014" // key[0] = "X2014", key[1] = "Z2014" join column "F2015" // key[0] = "F2015", key[1] = "X2014", key[2] = "Z2014"

So on the third iteration, column "F2015" jumped to the key[0] location even though I wanted it to be key[3]. It appears that because F comes before X and Z, it must be a sort. If I put numbers in front, they stay in the correct order. But.. yuk.

"001_X2014", "002_Z2014", "003_F2015"

adamklein commented 9 years ago

I am in favor of "disable the "automatic ordering preservation" behavior" for columns in joins. pandas implements it this way, and I think this choice follows the principle of least surprise.