bgm-cwg / novoCaller

MIT License
8 stars 1 forks source link

Providing example files #1

Open cmarcuscy opened 5 years ago

cmarcuscy commented 5 years ago

Hi developers of novoCaller,

I have tried running the first layer of novoCaller with the following command but the program just keep on running for over 24 hours without generating any output data. I am new to bioinformatics so please correct me if I made any mistakes.

Command: novoCaller -I input.vcf -O step_1_out.txt -T sample_id.txt -X 1 -P 0.005 -E 0.008

vcf: example.vcf.gz

sample ID file: sample_id.txt

It would be very helpful if you can provide example files for the program.

Thanks a lot!

Marcus

anwoy commented 5 years ago

Hi Marcus, Thank you for your question. The 'sample_id.txt' file should contain the sample names as is present in the vcf file. In the vcf file the sample names are AGG0030, AGG0031 and AGG0032 but in the 'sample_id.txt' file the sample names are sample1, sample2, and sample3. novoCaller needs unrelated control samples are present which the algorithm uses to judge the quality of the calls. The example vcf file contains only three samples which make the trio. Please try using an example vcf file with larger number of samples.

Best Regards, Anwoy

On Mon, Jan 7, 2019 at 5:23 PM cmarcuscy notifications@github.com wrote:

Hi developers of novoCaller,

I have tried running the first layer of novoCaller with the following command but the program just keep on running for over 24 hours without generating any output data. I am new to bioinformatics so please correct me if I made any mistakes.

Command: novoCaller -I input.vcf -O step_1_out.txt -T sample_id.txt -X 1 -P 0.005 -E 0.008

vcf: example.vcf.gz https://github.com/bgm-cwg/novoCaller/files/2732605/example.vcf.gz

sample ID file: sample_id.txt https://github.com/bgm-cwg/novoCaller/files/2732610/sample_id.txt

It would be very helpful if you can provide example files for the program.

Thanks a lot!

Marcus

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCNxKx0bpm--dwJi8lgjc_w4ljcrCzks5vAzU5gaJpZM4ZzXhq .

cmarcuscy commented 5 years ago

Hi Anwoy,

Thank you for you answers. Upon you suggestions, I have incorporated more samples (261 samples) in the run and make sure the samples names and sample ID matches, but still, the program is unable to generate any data (after running for 2 days), nor did an error message pops up. Do you have any suggestion on how I should troubleshoot?

Thanks a lot!

Regards, Marcus

anwoy commented 5 years ago

Can you please send me the vcf file and the samples.txt file?

On Fri, Jan 11, 2019, 8:15 AM cmarcuscy <notifications@github.com wrote:

Hi Anwoy,

Thank you for you answers. Upon you suggestions, I have incorporated more samples (261 samples) in the run and make sure the samples names and sample ID matches, but still, the program is unable to generate any data (after running for 2 days), nor did an error message pops up. Do you have any suggestion on how I should troubleshoot?

Thanks a lot!

Regards, Marcus

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-453354034, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN-0S9ydnveb35Mp3khL_NkGdmObXks5vB_q7gaJpZM4ZzXhq .

cmarcuscy commented 5 years ago

Dear Anwoy,

Please find the vcf (first 1000 lines) and samples.txt files below. Thanks!

novocaller_sample.vcf.gz

novoCaller_samples.txt

Marcus

anwoy commented 5 years ago

Thanks Marcus, I will get back to you soon.

On Sun, Jan 13, 2019 at 9:40 AM cmarcuscy notifications@github.com wrote:

Dear Anwoy,

Please find the vcf (first 1000 lines) and samples.txt files below. Thanks!

novocaller_sample.vcf.gz https://github.com/bgm-cwg/novoCaller/files/2752795/novocaller_sample.vcf.gz

novoCaller_samples.txt https://github.com/bgm-cwg/novoCaller/files/2752794/novoCaller_samples.txt

Marcus

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-453800561, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN8W-72wFVYTO--ffC3Bw6L7HBnpVks5vCrGZgaJpZM4ZzXhq .

anwoy commented 5 years ago

Hi Marcus, The caller was made to read the output of VEP (variant effect predictor) which is present in the FORMAT field with the key 'CSQ'. Since VEP was not run on the vcf file, the caller did not work. Thanks for finding this bug. I will make it so that the caller gives an error when it doesn't find the 'CSQ' key. You can try running VEP on the file and running the caller again.

Best Regards, Anwoy

On Sun, Jan 13, 2019 at 6:32 PM anwoy mohanty anwoy.rkl@gmail.com wrote:

Thanks Marcus, I will get back to you soon.

On Sun, Jan 13, 2019 at 9:40 AM cmarcuscy notifications@github.com wrote:

Dear Anwoy,

Please find the vcf (first 1000 lines) and samples.txt files below. Thanks!

novocaller_sample.vcf.gz https://github.com/bgm-cwg/novoCaller/files/2752795/novocaller_sample.vcf.gz

novoCaller_samples.txt https://github.com/bgm-cwg/novoCaller/files/2752794/novoCaller_samples.txt

Marcus

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-453800561, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN8W-72wFVYTO--ffC3Bw6L7HBnpVks5vCrGZgaJpZM4ZzXhq .

cmarcuscy commented 5 years ago

Hi Anwoy,

Thank you for your work to fix the bug. I will try running novocaller after running VEP.

Regards, Marcus

cmarcuscy commented 5 years ago

Hi Anwoy, I have tried annotating the vcf with VEP and I now successfully get the program to run. Nonetheless, I encounter some unexpected results.

infilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276.recaliecalls_kggseq_samprm_vep.vcf trio_ID_filename=/home/ramsar1971/project/asd/Reannotation/ASD88_Trio_novocaller.txt outfilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276_step1_out.txt X_choice=1 PP_thresh=0.005 ExAC_thresh=0.008 vcf_line_cols:


0 1 2 3 4 5 6 7 8

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

total_candidates=261 end_col=260 number of parents = 258 number of children = 3 parent_cols= 3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184:185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248:249:250:251:252:253:254:255:256:257:258:259:260: trio_set= 1:2:0: CSQ_ExAC_AF_col=32

It seems that the program only recognizes three sets of trios among the 88 trios included. Another point to note is that the output only contains 1 candidate DN mutation:

Do you have any idea? Thanks! Input vcf: 1000_novocaller.vcf.gz

Input txt file: pedigree.txt

Output file: novocaller_step1_out.txt

Marcus

anwoy commented 5 years ago

Hi Marcus, Sorry for the late reply. Yes the caller was made for a Mendelian diseases research team which generally works on cases comprising of one trio when a de-novo case is suspected. Although the code can be modified to give output for all the trios. The expected number of de-novo mutations in the coding region per trio (which is where the software looks at) is around 1 ~ 3 in number. So I would say the 1 call is within the expected number of calls. If you are interested in running the caller for a large scale de-novo study, the code will have to be modified slightly.

Best Regards, Anwoy

On Mon, Feb 11, 2019 at 9:03 AM cmarcuscy notifications@github.com wrote:

Hi Anwoy, I have tried annotating the vcf with VEP and I now successfully get the program to run. Nonetheless, I encounter some unexpected results.

infilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276.recaliecalls_kggseq_samprm_vep.vcf

trio_ID_filename=/home/ramsar1971/project/asd/Reannotation/ASD88_Trio_novocaller.txt

outfilename=/home/ramsar1971/project/asd/Reannotation/vep/ASD_276_step1_out.txt X_choice=1 PP_thresh=0.005 ExAC_thresh=0.008 vcf_line_cols:

0 1 2 3 4 5 6 7 8

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

total_candidates=261 end_col=260 number of parents = 258 number of children = 3 parent_cols=

3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99 💯 101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120:121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184:185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248:249:250:251:252:253:254:255:256:257:258:259:260: trio_set= 1:2:0: CSQ_ExAC_AF_col=32

It seems that the program only recognizes three sets of trios among the 88 trios included. Another point to note is that the output only contains 1 candidate DN mutation:

Do you have any idea? Thanks! Input vcf: 1000_novocaller.vcf.gz https://github.com/bgm-cwg/novoCaller/files/2849782/1000_novocaller.vcf.gz

Input txt file: pedigree.txt https://github.com/bgm-cwg/novoCaller/files/2849783/pedigree.txt

Output file: novocaller_step1_out.txt https://github.com/bgm-cwg/novoCaller/files/2849784/novocaller_step1_out.txt

Marcus

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-462211614, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN1jucQ83DLPq7NWP8bgOKCPcrbKbks5vMOSVgaJpZM4ZzXhq .

aojielian commented 5 years ago

Hi Anwoy,

I've already got the CSQ vcf which means run VEP on VCF. Here is my command to run novocaller "./novoCaller -I 11.vcf -O SSC02220.txt -T trio_ids.txt -X 1 -P 0.5 -E 0.008"

the trio_ids.txt looks like "SSC02220 SSC02219 SSC02217 "

The 11.vcf is quad vcf, which have 4 individuals in this VCF. Can novoCaller works on quad VCFs? or something wrong with my command line?

Sorry to ask you so many trivial questions

Best Regards,

Aojie

ghost commented 5 years ago

Hi Anwoy,

I am perplexed about unrelated control samples. Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype?

I am new to bioinformatics. There's so much that I don't understand. Sorry to ask you so many trivial questions

Thanks a lot! Liangdy

anwoy commented 5 years ago

Hi Liangdy, the unrelated samples can be samples with normal phenotype, or samples with other diseases.

Best Regards, Anwoy

On Mon, Mar 18, 2019 at 3:39 PM liangdyGao notifications@github.com wrote:

Hi Anwoy,

I am perplexed about unrelated control samples. Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype?

I am new to bioinformatics. There's so much that I don't understand. Sorry to ask you so many trivial questions

Thanks a lot! Liangdy

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-473847497, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN16Jcc14CW-0iAAWAr2hpdRZ-7fFks5vX2X2gaJpZM4ZzXhq .

anwoy commented 5 years ago

The unrelated samples must also not be related to the proband (cousins, uncles, aunts etc. of the proband are not preferred).

On Mon, Mar 18, 2019 at 3:39 PM liangdyGao notifications@github.com wrote:

Hi Anwoy,

I am perplexed about unrelated control samples. Are the unrelated samples those with normal phenotype, these with other disease or different samples that have the same phenotype?

I am new to bioinformatics. There's so much that I don't understand. Sorry to ask you so many trivial questions

Thanks a lot! Liangdy

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-473847497, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN16Jcc14CW-0iAAWAr2hpdRZ-7fFks5vX2X2gaJpZM4ZzXhq .

ghost commented 5 years ago

Hi Anwoy,

Thank you for your answers.

If we merge multiple vcf files by vcftools or bcftools , the unrelated sample information of the merged file may display as follows:

CHR POS ... AGG0002 AGG0003 AGG0001

 Q     X     ...    1/0:10,0:10:27:0,27,405    .:.:.:.:.                  .:.:.:.:.

AGG003 and AGG0001 lose information such as DP, PQ and so on .

When merging vcfs in bam-level with GATK , the information above is preserved. But the computional amount is obviously increased.

CHR POS ... AGG0002 AGG0003 AGG0001

  Q     X    ...    1/0:10,0:10:27:0,27,405    2/2:10,0:10:27:0,27,405     3/3:12,0:12:30:0,30,450

Which approach is more suitable for DNMs calling in order to maximize accuracy and eliminate false negatives? Or these adjustments almost have no effect on the final result?

Sorry to ask you so many trivial questions just like before

Thanks a lot! Liangdy

anwoy commented 5 years ago

The AD information (allele depth) is needed in as many unrelated samples as possible as that information is used to judge the quality of the de-novo call.

On Tue, Mar 19, 2019 at 7:38 AM liangdyGao notifications@github.com wrote:

Hi Anwoy,

Thank you for you answers.

If we merge multiple vcf files by vcftools or bcftools , the unrelated sample information of the merged file may display as follows:

CHR POS ... AGG0002 AGG0003 AGG0001

Q X ... 1/0:10,0:10:27:0,27,405 .:.:.:.:. .:.:.:.:.

AGG003 and AGG0001 lose information such as DP, PQ and so on .

When merging vcfs in bam-level with GATK , the information above is preserved. But the computional amount is obviously increased.

CHR POS ... AGG0002 AGG0003 AGG0001

Q X ... 1/0:10,0:10:27:0,27,405 0/0:10,0:10:27:0,27,405 0/0:12,0:12:30:0,30,450

Which approach is more suitable for DNMs calling in order to maximize accuracy and eliminate false negatives? Or these are almost no effect on the final result?

Sorry to ask you so many trivial questions just like before

Thanks a lot! Liangdy

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bgm-cwg/novoCaller/issues/1#issuecomment-474169917, or mute the thread https://github.com/notifications/unsubscribe-auth/AJwCN7yyvzJDzDV2RKI6cKZF1olwms-aks5vYEa1gaJpZM4ZzXhq .

olenamarchenko1234 commented 1 year ago

@anwoy Thank you for the tips! Can you provide an example of the runtime for an exome trio? full genome trio? Can it be scaled to run on a pvcf with 50K samples?