Open pintoa1-mskcc opened 1 year ago
so the issue was the header names, Headers are NOT allowed in CFF format Im assuming
Thanks for looking into the issue, @pintoa1-mskcc ! It would appear that is correct -- none of the test data has a header. I'll dig into the codebase to look into the assumptions the script makes when parsing CFF files.
@mapostolides, I know it's been a while, but if you know the answer, that would be great! Otherwise, I'll (hopefully) find time later this week.
If I remember correctly, the CFF files don’t have headers. The parsing script could be modified to expect a header if the header is needed. The wiki explains the meaning of each field of the CFF. Hope that helps!
On Wed, Mar 29, 2023 at 4:13 PM Michael Li @.***> wrote:
Thanks for looking into the issue, @pintoa1-mskcc https://github.com/pintoa1-mskcc ! It would appear that is correct -- none of the test data has a header. I'll dig into the codebase to look into the assumptions the script makes when parsing CFF files.
@mapostolides https://github.com/mapostolides, I know it's been a while, but if you know the answer, that would be great! Otherwise, I'll (hopefully) find time later this week.
— Reply to this email directly, view it on GitHub https://github.com/ccmbioinfo/MetaFusion/issues/6#issuecomment-1489242358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUY47SLXTV6VKMNOCPOHVDW6SJV5ANCNFSM6AAAAAAWKZ2VIY . You are receiving this because you were mentioned.Message ID: @.***>
Awesome, its fine it doesnt have headers. Could that be added to the documentation that metafusion does NOT expect any header/colnames as an input in the cff file? The wiki is a little misleading as the CFF format tab shows the example with a header
Hi all,
I looked into the issue a bit. It looks like line 82 Metafusion.sh
isn't checking carefully enough to make sure that valid lines are being captured. Using the column names in the CFF format, chr1
actually passes the regular expression /[0-9XY]/
. In fact, alternate chromosome names (e.g. chr12 and 12) and non-existent names (e.g. 9S and 2B) can also pass.
I'll make a quick change to that to fix the initial problem and continue to look at the code to clarify the CFF format specifications.
Michael
I attempted to use the convert to cff helper script provided, however the format outputted is not matching the expected output and it appears the wiki is outdated on how to utilize the tool. The convert_cff helper script returns a bit of a mess, where sample names are cut off, columns are merged incorrectly, and it doesnt have all the columns that are "mandatory" for CFF format (t_gene1 on seems to be missing).
I made my own script to exactly match the format on the wiki:
cff_format <- c("chr1","pos1","strand1","chr2","pos2","strand2","library","sample_name", "sample_type","disease","tool",'split_cnt',"span_cnt","t_gene1","t_area1", "t_gene2","t_area2")
However, when I try to run the metafusion.sh, the "reformat" step changes my "strand1" and "strand2" columns to NA columns, then when that is passed onto "renamed" step, I get EMPTY files. I also get a whole bunch of errors.
After the "reann" step, my cff file is completely empty and metafusion runs on all the empty files. I have successfully run your test CFF files through Metafusion, however cannot get a real example working.
Would it be possible for an update to the wiki to explain the exact format of CFF, whether or not NA's are allowed, the data type (int, string etc), and whether or not "disease" is important for analysis? At the moment we are putting NAs in the disease slot.
Im assuming that I am NOT supposed to have a header ing a cff format and that it MUST be in the order I specified above?